legumeinfo / microservices

A collection of microservices developed and maintained by the Legume Information System
https://legumeinfo.org/
Apache License 2.0
3 stars 2 forks source link

How to load a GFF into Redis? Need end-to-end example after fresh install. #600

Closed sammyjava closed 1 year ago

sammyjava commented 1 year ago

Are there any examples provided of loading a GFF file into Redis starting from scratch? I'm mystified. I can't figure out how to do it! I'm one of those users that really needs an end-to-end example of what to do. Here's what I've tried. I also don't know what a "chromosome GFF" is. My chromosomes are in a multi-FASTA. :) I just used the annotation GFF3 that has the chromosomes as sequences, I'm guessing I need to build a "chromosome GFF" somehow. :)

[Note: I'm hitting this like an unknown GitHub user, hopefully to help flesh out the documentation a bit.]

I'm running redis_loader 1.2.3 schema 1.1.0

[shokin@shokin-gcv gcv-docker-compose]$ export SPECIES=dumosus
[shokin@shokin-gcv gcv-docker-compose]$ export STRAIN=PI311196
[shokin@shokin-gcv gcv-docker-compose]$ export GENE_GFF_FILE=/falafel/shokin/ph-pangenome/liftoff/PI311196/G19833/phadu.PI311196.gnm1.phavu.G19833.gnm2.ann1.gff3 
[shokin@shokin-gcv gcv-docker-compose]$ export STRAIN=PI311196.G19833
[shokin@shokin-gcv gcv-docker-compose]$ export CHROMOSOME_GFF_FILE=/falafel/shokin/ph-pangenome/liftoff/PI311196/G19833/phadu.PI311196.gnm1.phavu.G19833.gnm2.ann1.gff3 
[shokin@shokin-gcv gcv-docker-compose]$ export GFA_FILE=/home/shokin/liftoff/PI311196/G19833/phadu.PI311196.gnm1.phavu.G19833.gnm2.ann1.gfa.tsv
[shokin@shokin-gcv gcv-docker-compose]$ sudo docker compose -f compose.yml -f compose.prod.yml run redis_loader --help
[+] Building 0.0s (0/0)                                                                                                                                                                                            
[+] Creating 1/0
 ✔ Container gcv-redis-1  Running                                                                                                                                                                             0.0s 
[+] Building 0.0s (0/0)                                                                                                                                                                                            
usage: redis_loader [-h] [--version] [--redis-db REDIS_DB] [--redis-password REDIS_PASSWORD] [--redis-host REDIS_HOST] [--redis-port REDIS_PORT] [--chunk-size CHUNK_SIZE] [--no-save]
                    [--load-type {new,reload,append}] [--sequence-types {chromosome,supercontig,chloroplast,mitochondrion} [{chromosome,supercontig,chloroplast,mitochondrion} ...]]
                    {chado,gff} ...

Loads data from a Chado (PostreSQL) database or GFF files into a RediSearch index for use by microservices.

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --redis-db REDIS_DB   The Redis database (can also be specified using the REDIS_DB environment variable). (default: 0)
  --redis-password REDIS_PASSWORD
                        The Redis password (can also be specified using the REDIS_PASSWORD environment variable). (default: )
  --redis-host REDIS_HOST
                        The Redis host (can also be specified using the REDIS_HOST environment variable). (default: redis)
  --redis-port REDIS_PORT
                        The Redis port (can also be specified using the REDIS_PORT environment variable). (default: 6379)
  --chunk-size CHUNK_SIZE
                        The chunk size to be used for Redis batch processing (can also be specified using the CHUNK_SIZE environment variable). (default: 100)
  --no-save             Don't save the Redis database to disk after loading. (default: False)
  --load-type {new,reload,append}
                        How the data should be loaded into Redis: new - Will only load indexes if they have to be created first. reload - Will remove existing indexes before loading data. append - Will add
                        data to an existing index or create a new index. (can also be specified using the LOAD_TYPE environment variable). (default: append)
  --sequence-types {chromosome,supercontig,chloroplast,mitochondrion} [{chromosome,supercontig,chloroplast,mitochondrion} ...]
                        What sequence types should be loaded into Redis: chromosome - full nuclear chromosomes supercontig - scaffolds and contigs chloroplast - chloroplast organelle mitochondrion -
                        mitochondrial organelle (can also be specified using the SEQUENCE_TYPES environment variable). (default: chromosome)

commands:
  {chado,gff}
    chado               Load data from a Chado (PostgreSQL) database.
    gff                 Load data GFF files.

OK, from that I discern I use a command like this:

[shokin@shokin-gcv gcv-docker-compose]$ sudo docker compose -f compose.yml -f compose.prod.yml run redis_loader --load-type=reload --sequence-types=chromosome gff
[+] Building 0.0s (0/0)                                                                                                                                                                                            
[+] Creating 1/0
 ✔ Container gcv-redis-1  Running                                                                                                                                                                             0.0s 
[+] Building 0.0s (0/0)                                                                                                                                                                                            
    "chromosomeIdx" already exists in RediSearch
    Data will be appended to index "chromosomeIdx"
    "geneIdx" already exists in RediSearch
    Data will be appended to index "geneIdx"
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 354, in <module>
    main()
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 350, in main
    args.command(loader, args)
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 46, in gff
    args.genus,
    ^^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'genus'
adf-ncgr commented 1 year ago

Agreed that the behavior when no args supplied is underwhelming, but try: docker compose -f compose.yml -f compose.prod.yml run redis_loader gff --help and see if that --helps at all

sammyjava commented 1 year ago

Thanks, but I'm not that interested in getting the job done as much as getting the documentation improved. I know I can ask you, @adf-ncgr , for an example. :)

But head me off at the pass if you don't want me to file newbie issues.

adf-ncgr commented 1 year ago

Well, I was just intending for you to consider whether the additional usage for gff mode was sufficient for a newbie; probably not!

sammyjava commented 1 year ago

And, what IS a chromosome GFF? Just a GFF with the @chromosome records??? Is that a separate GFF? Seriously, I've never heard of a "chromosome GFF" before, but I'm no bioinformatician. I presume that's why I got this. But, bottom line: full example with example files is GOLD. Then I could see what "chromosome GFF" is, etc.

[shokin@shokin-gcv gcv-docker-compose]$ sudo docker compose -f compose.yml -f compose.prod.yml run redis_loader gff --genus=Phaseolus --species=dumosus --strain=PI311196.G19833 --gene-gff /falafel/shokin/ph-pangenome/liftoff/PI311196/G19833/phadu.PI311196.gnm1.phavu.G19833.gnm2.ann1.gff3 
[+] Building 0.0s (0/0)                                                                                                                                                                                            
[+] Creating 1/0
 ✔ Container gcv-redis-1  Running                                                                                                                                                                             0.0s 
[+] Building 0.0s (0/0)                                                                                                                                                                                            
    "chromosomeIdx" already exists in RediSearch
    Data will be appended to index "chromosomeIdx"
    "geneIdx" already exists in RediSearch
    Data will be appended to index "geneIdx"
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 354, in <module>
    main()
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 350, in main
    args.command(loader, args)
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 49, in gff
    args.chromosome_gff,
    ^^^^^^^^^^^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'chromosome_gff'
adf-ncgr commented 1 year ago

I'm not going to presume to nominate exemplar files for the github repo at this time, but you can have a look at: /falafel/legumeinfo/data/v2/Aeschynomene/evenia/genomes/CIAT22838.gnm1.XF73/aesev.CIAT22838.gnm1.XF73.genome_main.gff3.gz

sammyjava commented 1 year ago

Ahhhh the ol' /genomes/ GFF file, rings a bell! Thanks! So, of course, the next question: is there a standard script for building those from the genome FASTA? Like fasta2gff or something? (Asking, because if there is, that should be added to the docs here since I'm not sure everyone knows what a "chromosome GFF" is, but they likely have them in a multi-FASTA.)

adf-ncgr commented 1 year ago

I use: /falafel/adf/sw/hacks/lis_fasta2gff3.pl although this doesn't attempt to solve the "what is a chromosome" question. Looks like there's another approach here: https://github.com/legumeinfo/datastore-specifications/blob/main/scripts/chrlen_to_gff.sh

Since @alancleary graduated, I think I've been forbidden from adding perl scripts to the GCV repos...

sammyjava commented 1 year ago
[shokin@dal datastore-specifications]$ scripts/chrlen_to_gff.sh ~/Phaseolus/acutifolius/genomes/Tep23.gnm1/phaac.Tep23.gnm1.genome_main.fna phaac.Tep23.gnm1
##gff-version 3
scripts/chrlen_to_gff.sh: line 39: type: unbound variable
sammyjava commented 1 year ago
[shokin@dal Tep23.gnm1]$ cat phaac.Tep23.gnm1.genome_main.fna | /falafel/adf/sw/hacks/lis_fasta2gff3.pl -type=chromosome > phaac.Tep23.gnm1.genome_main.gff3

works fine.

sammyjava commented 1 year ago

I think I'm gonna quit this exercise.

[shokin@shokin-gcv gcv-docker-compose]$ sudo docker compose -f compose.yml -f compose.prod.yml run redis_loader gff --genus=Phaseolus --species=dumosus --strain=PI311196.G19833 --gene-gff=/falafel/shokin/ph-pangenome/liftoff/PI311196/G19833/phadu.PI311196.gnm1.phavu.G19833.gnm2.ann1.gff3 --chromosome-gff=/falafel/gepts_lab/legumeinfo/Phaseolus/dumosus/genomes/PI311196.gnm1/phadu.PI311196.gnm1.genome_main.gff3 --gfa=/falafel/shokin/ph-pangenome/liftoff/PI311196/G19833/phadu.PI311196.gnm1.phavu.G19833.gnm2.ann1.gfa.tsv 
[+] Building 0.0s (0/0)                                                                                                                                                                                            
[+] Creating 1/0
 ✔ Container gcv-redis-1  Running                                                                                                                                                                             0.0s 
[+] Building 0.0s (0/0)                                                                                                                                                                                            
    "chromosomeIdx" already exists in RediSearch
    Data will be appended to index "chromosomeIdx"
    "geneIdx" already exists in RediSearch
    Data will be appended to index "geneIdx"
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 354, in <module>
    main()
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 350, in main
    args.command(loader, args)
  File "/usr/local/lib/python3.11/site-packages/redis_loader/__main__.py", line 44, in gff
    loadFromGFF(
  File "/usr/local/lib/python3.11/site-packages/redis_loader/loaders/gff.py", line 125, in loadFromGFF
    transferChromosomes(redisearch_loader, genus, species, chromosome_gff)
  File "/usr/local/lib/python3.11/site-packages/redis_loader/loaders/gff.py", line 29, in transferChromosomes
    gffutils.create_db(
  File "/usr/local/lib/python3.11/site-packages/gffutils/create.py", line 1359, in create_db
    iterator = iterators.DataIterator(**kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/gffutils/iterators.py", line 314, in DataIterator
    raise ValueError(
ValueError: /falafel/gepts_lab/legumeinfo/Phaseolus/dumosus/genomes/PI311196.gnm1/phadu.PI311196.gnm1.genome_main.gff3 cannot be found and does not appear to be a URL

[shokin@shokin-gcv gcv-docker-compose]$ cat /falafel/gepts_lab/legumeinfo/Phaseolus/dumosus/genomes/PI311196.gnm1/phadu.PI311196.gnm1.genome_main.gff3
##gff-version 3
phadu.PI311196.gnm1.Chr01   .   chromosome  1   59123956    .   .   .   ID=phadu.PI311196.gnm1.Chr01;Name=phadu.PI311196.gnm1.Chr01
phadu.PI311196.gnm1.Chr02   .   chromosome  1   61340039    .   .   .   ID=phadu.PI311196.gnm1.Chr02;Name=phadu.PI311196.gnm1.Chr02
phadu.PI311196.gnm1.Chr03   .   chromosome  1   59791010    .   .   .   ID=phadu.PI311196.gnm1.Chr03;Name=phadu.PI311196.gnm1.Chr03
phadu.PI311196.gnm1.Chr04   .   chromosome  1   61329659    .   .   .   ID=phadu.PI311196.gnm1.Chr04;Name=phadu.PI311196.gnm1.Chr04
phadu.PI311196.gnm1.Chr05   .   chromosome  1   52062745    .   .   .   ID=phadu.PI311196.gnm1.Chr05;Name=phadu.PI311196.gnm1.Chr05
phadu.PI311196.gnm1.Chr06   .   chromosome  1   33783503    .   .   .   ID=phadu.PI311196.gnm1.Chr06;Name=phadu.PI311196.gnm1.Chr06
phadu.PI311196.gnm1.Chr07   .   chromosome  1   64926652    .   .   .   ID=phadu.PI311196.gnm1.Chr07;Name=phadu.PI311196.gnm1.Chr07
phadu.PI311196.gnm1.Chr08   .   chromosome  1   76278163    .   .   .   ID=phadu.PI311196.gnm1.Chr08;Name=phadu.PI311196.gnm1.Chr08
phadu.PI311196.gnm1.Chr09   .   chromosome  1   45873673    .   .   .   ID=phadu.PI311196.gnm1.Chr09;Name=phadu.PI311196.gnm1.Chr09
phadu.PI311196.gnm1.Chr10   .   chromosome  1   54228812    .   .   .   ID=phadu.PI311196.gnm1.Chr10;Name=phadu.PI311196.gnm1.Chr10
phadu.PI311196.gnm1.Chr11   .   chromosome  1   66020725    .   .   .   ID=phadu.PI311196.gnm1.Chr11;Name=phadu.PI311196.gnm1.Chr11
phadu.PI311196.gnm1.Super-Scaffold_27_32    .   supercontig 1   8582530 .   .   .   ID=phadu.PI311196.gnm1.Super-Scaffold_27_32;Name=phadu.PI311196.gnm1.Super-Scaffold_27_32
[shokin@shokin-gcv gcv-docker-compose]$
adf-ncgr commented 1 year ago

I think this is a simple issue of the script not being able to see paths outside the container try a --bind /falafel:/falafel or something similar

sammyjava commented 1 year ago

Whatever. I'm not really keen to learn how to build a GCV, just thought I'd give it a quick shot. I think I'll close this issue and just post an issue saying a HOWTO build from a GFF would be helpful. Then I'll review that.