legumeinfo / microservices

A collection of microservices developed and maintained by the Legume Information System
https://legumeinfo.org/
Apache License 2.0
3 stars 2 forks source link

Implement GFF loading script #10

Closed alancleary closed 3 years ago

alancleary commented 4 years ago

As part of the move away from Chado via microservices we need to implement a script that loads data into the GCV Redis database directly from GFF files. There is already a script that loads data from Chado. This will be maintained for migration and adoptability purposes. As such, I'll isolate the part of that script that actually puts data into Redis into its own module so both scripts can use the code. This will make updating the loading scripts when the Redis schema changes a matter of changing one file.

The GFF loader should be written in Python, use spaces instead of tabs, and conform to the PEP8 code style guide. Also, use of the loader should be documented in the README.md file of the scripts directory.

This issue is blocked by issue #9 and shouldn't be started until #9 is closed.

sammyjava commented 4 years ago

Ohhh I have to write it in Python. I'm a bad choice, my Python skills are next to zero. I thought I got to choose the language.

alancleary commented 4 years ago

If it were a microservice you were going to take ownership of then you would get to choose the language. However, the language of this project for all things not user interface or microservice related is Python.

Are you sure you don't want to take a stab at it? I'm pretty sure you could use a Biopython library for the GFF parsing and the Chado loader should show you how the data needs to be handled.

sammyjava commented 4 years ago

Sure, I'll take a stab at it in Python.

adf-ncgr commented 4 years ago

I feel like I'm bearing witness to the blossoming of a benevolent tyranny. I suspect the tyrant is having his benevolent revenge on certain perl scripts he was once forced to write in support of such tasks back in the day. PEP8 or PERISH!

sammyjava commented 4 years ago

Nah, no underlying tyrannical agenda, I just haven't written a ton of Python, and originally thought I was going to write it in Java. All grievances with Perl were put to rest by the Revolution (and subsequent Civil War).

alancleary commented 4 years ago

@sammyjava @adf-ncgr The Redis schema is good to go. We should discuss what structure both the Chado and GFF loaders can put data into for the Redis loading functions and then I'll encapsulate/update that code so it can be used by both loaders.

nathanweeks commented 4 years ago

Since the lack of a loader is blocking #11, if it's much more expedient for @sammyjava to bang out an initial loader in Java, it should be straightforward to containerize and add as a service (say, "load_gff") in the Docker Compose file that @alancleary added... and we can translate later if need be (IMHO)

sammyjava commented 4 years ago

I'm fine with Python, Java, whatever. if there's a reason to go with Python. If it absolutely does not matter, then my preference would be Java, but it sounds like Python makes more sense.

alancleary commented 4 years ago

As far as the master branch of the main GCV repo is concerned, the Docker compose files and microservices still work. So unless @adf-ncgr or some other authority within the project is itching to get microservices epic out the door, I think we should just do it in Python. That way if I decide to change the Redis schema in the near future only the Redis loader that the GFF and Chado loaders share will need to be modified, i.e. Sam doesn't have to learn the underlying Redis schema or take ownership of the GFF loader.

adf-ncgr commented 4 years ago

I am not itching yet, and the benevolence of your dictatorship still leads us on...

nathanweeks commented 4 years ago

Doing it the "right" way (in Python) the first time is fine by me. I just wasn't sure if we were doing the equivalent of asking James Gosling to go learn Python before contributing loader scripts.

sammyjava commented 3 years ago

I'm getting on this. Just to confirm, @alancleary : write gff_to_redisearch.py which mimics chado_to_redisearch.py by loading chromosomes (transferChromosomes) and genes (transferGenes) from a GFF, buth run in transferData. You don't need sequences, so I don't need the genomic FASTA for the chromosomes.

adf-ncgr commented 3 years ago

We do need to know how long the chromosomes (or supercontigs) are, though. Most of our gffs don't encode that info, methinks. Could be gotten from the fai files.

alancleary commented 3 years ago

@sammyjava That's correct. I still intend to separate the code that puts data into Redis from the code that parses/loads the data from wherever. Ideally the prior will be shared by the two loading scripts so maintenance is easy. For now lets just get GFF loading running and we'll worry about convergence later.

sammyjava commented 3 years ago

We do need to know how long the chromosomes (or supercontigs) are, though. Most of our gffs don't encode that info, methinks. Could be gotten from the fai files.

Well, they should. I say we add the chromosomes to the GFFs. That's pretty commonly done, both through the headers and regular lines. I can handle that. To wit:

##gff-version 3
##sequence-region   1 1 307041717
##sequence-region   10 1 150982314
##sequence-region   2 1 244442276
##sequence-region   3 1 235667834
##sequence-region   4 1 246994605
##sequence-region   5 1 223902240
##sequence-region   6 1 174033170
##sequence-region   7 1 182381542
##sequence-region   8 1 181122637
##sequence-region   9 1 159769782
##sequence-region   B73V4_ctg1 1 50531
...
1   wareLab chromosome  1   307041717   .   .   .   ID=chromosome:1
10  wareLab chromosome  1   150982314   .   .   .   ID=chromosome:10
2   wareLab chromosome  1   244442276   .   .   .   ID=chromosome:2
3   wareLab chromosome  1   235667834   .   .   .   ID=chromosome:3
4   wareLab chromosome  1   246994605   .   .   .   ID=chromosome:4
5   wareLab chromosome  1   223902240   .   .   .   ID=chromosome:5
6   wareLab chromosome  1   174033170   .   .   .   ID=chromosome:6
7   wareLab chromosome  1   182381542   .   .   .   ID=chromosome:7
8   wareLab chromosome  1   181122637   .   .   .   ID=chromosome:8
9   wareLab chromosome  1   159769782   .   .   .   ID=chromosome:9
Mt  wareLab chromosome  1   569630  .   .   .   ID=chromosome:Mt;Alias=AY506529.1;Is_circular=true
Pt  wareLab chromosome  1   140384  .   .   .   ID=chromosome:Pt;Alias=X86563.2;Is_circular=true
B73V4_ctg1  wareLab contig  1   50531   .   .   .   ID=contig:B73V4_ctg1
sammyjava commented 3 years ago

I guess we should pick between the terms "contig" and "supercontig" in the GFF records, which I realize we need to distinguish them from chromosomes. I'd lean toward "contig" since they can be any size, and that's what Doreen Ware does, and that's good enough for me.

StevenCannon-USDA commented 3 years ago

Agree: it is often useful to have the chromosome (and scaffold) sizes. We've been adding those in VCF files; might as well in the GFFs too. Regarding "contig" vs. "scaffold" or "supercontig": for most assemblies we deal with, "scaffold" is the most common descriptor (scaffold usually being composed of contigs). We occasionally see "contig" in small leftover stuff that scaffolders weren't able to pull into a larger unit. Unless it causes problems, I would try to stick with the terms we see in the assemblies - i.e. chromosome, scaffold, (occasionally) contig.

adf-ncgr commented 3 years ago

I think we were previously leaning towards putting per-sequence metadata like type and original names into a separate file ala NCBI's assembly reports. I don't think we ought to make new requirements/changes to our gffs without an RFO. For example, a browser that was feeding directly off said gffs might end up getting a big bar spanning each reference sequence (unless we did this as "headers" rather than features); likely no lives would be lost, but still... and in any case, our censors should be notified of the new requirements if they exist.

VCF spec requires the refseqs in the header, so it's a somewhat different kettle of beans, methinks.

sammyjava commented 3 years ago

It's a spec update either way. I'm fine with creating new GFFs, simpler, which should be in the assembly, not annotation, directories alongside the FASTAs. Shall I create said spec in the usual place? Here's an example: Phaseolus_vulgaris/G19833.gnm2.fC0g/phavu.G19833.gnm2.fC0g.genome_main.gff3

##gff-version   3
phavu.G19833.gnm2.Chr01 LIS chromosome  1   51433939    .   .   .   ID=phavu.G19833.gnm2.Chr01
phavu.G19833.gnm2.Chr02 LIS chromosome  1   49670989    .   .   .   ID=phavu.G19833.gnm2.Chr02
phavu.G19833.gnm2.Chr03 LIS chromosome  1   53438756    .   .   .   ID=phavu.G19833.gnm2.Chr03
phavu.G19833.gnm2.Chr04 LIS chromosome  1   48048378    .   .   .   ID=phavu.G19833.gnm2.Chr04
phavu.G19833.gnm2.Chr05 LIS chromosome  1   40923498    .   .   .   ID=phavu.G19833.gnm2.Chr05
phavu.G19833.gnm2.Chr06 LIS chromosome  1   31236378    .   .   .   ID=phavu.G19833.gnm2.Chr06
phavu.G19833.gnm2.Chr07 LIS chromosome  1   40041001    .   .   .   ID=phavu.G19833.gnm2.Chr07
phavu.G19833.gnm2.Chr08 LIS chromosome  1   63048260    .   .   .   ID=phavu.G19833.gnm2.Chr08
phavu.G19833.gnm2.Chr09 LIS chromosome  1   38250102    .   .   .   ID=phavu.G19833.gnm2.Chr09
phavu.G19833.gnm2.Chr10 LIS chromosome  1   44302882    .   .   .   ID=phavu.G19833.gnm2.Chr10
phavu.G19833.gnm2.Chr11 LIS chromosome  1   53580169    .   .   .   ID=phavu.G19833.gnm2.Chr11
phavu.G19833.gnm2.scaffold_12   LIS scaffold    1   1160169 .   .   .   ID=phavu.G19833.gnm2.scaffold_12    
phavu.G19833.gnm2.scaffold_13   LIS scaffold    1   1050217 .   .   .   ID=phavu.G19833.gnm2.scaffold_13    
phavu.G19833.gnm2.scaffold_14   LIS scaffold    1   837347  .   .   .   ID=phavu.G19833.gnm2.scaffold_14
...
adf-ncgr commented 3 years ago

works for me; coincidentally, that's exactly how I go about loading minimalist chadoi for GCV purposes.

StevenCannon-USDA commented 3 years ago

@adf-ncgr - hmm. Good points & reminder. I suppose yes: a significant new requirement for the GFFs shouldn't be undertaken lightly. For example, the extra stuff couldn't just be tacked on to the front, if we require them to be sorted. Looking through our current assembly files: I find no distinct GFFs for the assembly sizes - even though several applications use those (at least GBrowse and GCViT). Filename proposal: gensp.accsn.gnm#.KEY4.chr_scaff_sizes.gff3

I have put my little molecule-size-calculating script at /usr/local/www/data/about_the_data_store/scripts/chrlen_to_gff.sh

Usage: chrlen_to_gff.sh *genome_main.fna glyma.Lee.gnm1 > gensp.accsn.gnm#.KEY4.chr_scaff_sizes.gff3

sammyjava commented 3 years ago

I looked at the SO Wiki re. landmark features, and no specific guidance is given, other than to skip the ##sequence-region pragma as being "poor practice": http://www.sequenceontology.org/so_wiki/index.php/Landmark_features

However, I will note that in LIS chado, we store them exclusively as "supercontig", which is also what they're called in the mines. So I recommend we call the non-chromosome feature "supercontig" for LIS consistency. We're using these GFFs for our own purposes, they don't need to reflect what the originators called them, which can be a variety of things, which gets confusing.

lis-chado=> select count(*) from feature where type_id =43174; *supercontig*
 count  
--------
 263674
(1 row)

lis-chado=> select count(*) from feature where type_id =42961; *contig*
 count 
-------
     0
(1 row)

lis-chado=> select count(*) from feature where type_id =48866; *scaffold*
 count 
-------
     0
sammyjava commented 3 years ago

@cann0010 why not simply gensp.accsn.gnm.KEY4.genome_main.gff3 ? They are GFFs derived from the corresponding FASTA, no added information. chr_scaff_sizes just adds an extra nugget that doesn't seem to add any information to me.

It shouldn't be confused with the annotation GFF because (1) it's in a different directory and (2) it doesn't contain ann.

StevenCannon-USDA commented 3 years ago

@sammyjava OK :-) gensp.accsn.gnm.KEY4.genome_main.gff3 ... and supercontig

adf-ncgr commented 3 years ago

I agree with @sammyjava proposal for naming; as far as SO terms for types are concerned, I should note that I have recently started using mitochondrial_chromosome and chloroplast_chromosome. Because I aspire to biological semantic purity (but don't think scaffold vs contig vs whatever qualifies as biological...)

sammyjava commented 3 years ago

Hrm, @adf-ncgr usually loaders treat non-nuclear as "chromosome", and GFFs typically label them as such, to wit:

Mt  wareLab chromosome  1   569630  .   .   .   ID=chromosome:Mt;Alias=AY506529.1;Is_circular=true
Pt  wareLab chromosome  1   140384  .   .   .   ID=chromosome:Pt;Alias=X86563.2;Is_circular=true

or

Mt  TAIR10  chromosome  1   366924  .   .   .   ID=chromosome:Mt;Alias=Y08501.2,NC_001284.2;Is_circular=true
Pt  TAIR10  chromosome  1   154478  .   .   .   ID=chromosome:Pt;Alias=AP000423.1,NC_000932.1;Is_circular=true

which I guess may be an Ensembl standard since these are the same. Which is a good standard IMO. By using the child terms you may risk loaders dropping them. Or having to code in two extra feature types to pull from GFFs. Although I appload your ontological exactness.

chromosome (SO:0000340)
-- chloroplast_chromosome (SO:0000820)
-- mitochondrial_chromosome (SO:0000819)
adf-ncgr commented 3 years ago

@sammyjava I might argue that they are sufficiently different from nuclear chromosomes that some applications would want to ignore them. But, I don't feel super-strongly about this issue, so let's hear what @cann0010 thinks before coming to a final decision

sammyjava commented 3 years ago

I'll spec it anyway, which requires a papal blessing.

adf-ncgr commented 3 years ago

I didn't expect the Soybase Inquisition! ;)

StevenCannon-USDA commented 3 years ago

I don't think we have any cases of the mitochondrial or chloroplast genomes being included (intentionally) in genome_main. I think Phytozome usually splits them out into separate files. So, maybe moot - but I suppose could be spec'd "in case."

adf-ncgr commented 3 years ago

pretty sure we have a couple (though maybe not from Phytozome). Lupinus albus, and Medicago truncatula come to mind. NCBI adds them into the RefSeq representation even when excluded from the base assembly chosen to be the RefSeq representative version.

sammyjava commented 3 years ago

Yeah, I've spec'ed it to use "chromosome" for those until otherwise overridden. Here's an example that @adf-ncgr mentions:

>lupal.Amiga.gnm1.Lalb_Chr01 len=23521228
>lupal.Amiga.gnm1.Lalb_Chr02 len=17964110
...
>lupal.Amiga.gnm1.Lalb_Chr25 len=16405666
>lupal.Amiga.gnm1.Lalb_Chr00c01 len=1566442
...
>lupal.Amiga.gnm1.Lalb_Chr00c64 len=17842
>lupal.Amiga.gnm1.Lalb_chloroplast_genome 
>lupal.Amiga.gnm1.Lalb_mitochondrion
sammyjava commented 3 years ago

@alancleary I see that your Redis schema includes only name, length, genus, and species:

  fields = [
      redisearch.TextField('name'),
      redisearch.NumericField('length'),
      redisearch.TextField('genus'),
      redisearch.TextField('species'),
    ]

while in LIS we have accessions/strains, assembly versions (gnm), and annotation versions (ann). I could see storing only a single assembly+annotation version per strain, but we certainly have many strains per species.

Do you want me to concatenate species_strain (vulgaris_G19833) as done in chado, or shall we begin storing strains as an additional field? And skip recording the assembly and annotation versions? I'm fine with either, but the chado thing of species_strain is kinda kludgey.

adf-ncgr commented 3 years ago

agreed that what we did in chado is a kludge (to chado's credit, there is an updated version of schema that contains an actual field for "infraspecies" or some such moniker).

whatever we do (including with genus and species) we should think about how it will be encoded in the source files (and make this "friendly" to potential adopters of GCV who aren't necessarily into our metadata/naming conventions). The gff3 spec has some "pragmas" about some of this, but I'm not sure that's necessarily the best approach.

sammyjava commented 3 years ago

@alancleary OH we need gene families, of course. That's a third file. Just making a note of it here. The "GFF" loader loads three files: chromosomes (GFF), genes (GFF), and gene family assignments (GFA).

sammyjava commented 3 years ago

@alancleary hit a snag already. I've installed all the requirements via pip3; the only one already on my system is six, 1.14.0 rather than 1.15.0. Oh, and it's python 3.8.2.

shokin@morangie:~/gcv-microservices/database$ ./chado_to_redisearch.py 
  File "./chado_to_redisearch.py", line 39
    parser.add_argument('--pdb', action=EnvArg, envvar=pdb_envvar, type=str, default='chado', help=f'The PostgreSQL database (can also be specified using the {pdb_envvar} environment variable).')
                                                                                                                                                                                                 ^
SyntaxError: invalid syntax
adf-ncgr commented 3 years ago

are you sure the python in your PATH is 3.8.2? the script seems to be set up to use whatever you have:

!/usr/bin/env python

probably expecting it to be in a virtualenv try invoking explicitly with python3 interpreter and see what happens

alancleary commented 3 years ago

Sorry for the delay.

I'm not exactly crazy about changing the schema at the moment so for the time being why don't we encode that sort of info in the name as is being done by LIS. Ideally that'll allow LIS to use the GFF loader rather than the Chado loader and GCV will still be able to inter-operate with other tools in LIS using names. However, I think the proper formatting of the names should be left as a task for the user since not everyone in the world is going to have/want the same naming scheme as LIS.

Regarding dependencies, you should use Python virtual environments when developing in Python so you don't have to deal with system packages conflicting with the dependencies of whatever you're working on.

sammyjava commented 3 years ago

Got it. A bit confused by the choice of pipenv vs. virtualenv -- which do you use? Is there a way to automatically install the requirements, or is requirements.txt just a list for me to install manually? (I had to manually pip install a few, but not all.) I've added gffutils=0.10.1 and biopython=1.78 to database/requirements.txt.

sammyjava commented 3 years ago

FYI.

(gcv-microservices) shokin@morangie:~/gcv-microservices$ pip list
Package     Version
----------- ---------
argcomplete 1.12.1
argh        0.26.2
biopython   1.78
certifi     2020.6.20
chardet     3.0.4
gffutils    0.10.1
hiredis     1.1.0
idna        2.10
numpy       1.19.3
pip         20.2.4
psycopg2    2.8.6
pyfaidx     0.5.9.1
redis       3.5.3
redisearch  2.0.0
requests    2.24.0
rmtest      0.7.0
setuptools  50.3.2
simplejson  3.17.2
six         1.15.0
urllib3     1.25.11
wheel       0.35.1
sammyjava commented 3 years ago

I'm not exactly crazy about changing the schema at the moment so for the time being why don't we encode that sort of info in the name as is being done by LIS. Ideally that'll allow LIS to use the GFF loader rather than the Chado loader and GCV will still be able to inter-operate with other tools in LIS using names. However, I think the proper formatting of the names should be left as a task for the user since not everyone in the world is going to have/want the same naming scheme as LIS.

So by that I presume you mean add them as command-line parameters: --genus=Phaseolus --species=vulgaris --strain=G19833. Works for me. genus and species required, strain optional.

alancleary commented 3 years ago

Whoops. That's not the article I meant to link. Anyway, I use virtualenv:

$ virtualenv venv
$ . ./venv/bin/activate
(venv) $ pip install -r requirements.txt

If you want to update what's in the requirements file:

(venv) $ pip freeze > requirements.txt

I must've misunderstood what you were saying earlier about genus/species/strain. If they're not provided in the GFF then I would add them as command-line parameters/arguments as you suggested, though strain currently isn't encoded in the schema and I'm not worried about it for the time being.

sammyjava commented 3 years ago

Thanks for the virtualenv tips.

I must've misunderstood what you were saying earlier about genus/species/strain. If they're not provided in the GFF then I would add them as command-line parameters/arguments as you suggested, though strain currently isn't encoded in the schema and I'm not worried about it for the time being.

OK, I'll leave strain out for now, then.

However, the LIS datastore has many strains per species. We need to incorporate strain somehow in the schema. (And perhaps even genome and assembly versions, since we have multiple of those per strain in a few cases.)

Here is what's currently in LegumeMine (just to hammer the point home):

   genus   |    species    |    strain    
-----------+---------------+--------------
 Arachis   | duranensis    | V14167
 Arachis   | hypogaea      | Tifrunner
 Arachis   | ipaensis      | K30076
 Cajanus   | cajan         | ICPL87119
 Cicer     | arietinum     | CDCFrontier
 Cicer     | arietinum     | ICC4958
 Glycine   | max           | Lee
 Glycine   | max           | Wm82
 Glycine   | max           | Zh13
 Glycine   | soja          | PI483463
 Glycine   | soja          | W05
 Lotus     | japonicus     | MG20
 Lupinus   | albus         | Amiga
 Lupinus   | angustifolius | Tanjil
 Medicago  | sativa        | XinJiangDaYe
 Medicago  | truncatula    | A17_HM341
 Medicago  | truncatula    | jemalong_A17
 Phaseolus | lunatus       | G27455
 Phaseolus | vulgaris      | G19833
 Phaseolus | vulgaris      | UI111
 Trifolium | pratense      | MilvusB
 Vigna     | angularis     | Gyeongwon
 Vigna     | angularis     | Shumari
 Vigna     | radiata       | VC1973A
 Vigna     | unguiculata   | CB5-2
 Vigna     | unguiculata   | IT97K-499-35
 Vigna     | unguiculata   | Sanzi
 Vigna     | unguiculata   | Suvita2
 Vigna     | unguiculata   | TZ30
 Vigna     | unguiculata   | UCR779
 Vigna     | unguiculata   | Xiabao_II
 Vigna     | unguiculata   | ZN016
sammyjava commented 3 years ago

@alancleary could you give me the commands (FT.CREATE, etc.) you use to initialize the Redis indexes with the appropriate parameters? I need to create the chromosomeIdx, etc. and I don't see any scripts for doing that in the repo. Thanks!

Nevermind, I see that I had to run it twice. Or I may have had the relevant line commented out. It's there now in any case.

sammyjava commented 3 years ago

Actually that brings up a point. Your script loads the entire chado database in one swell foop. Which makes sense since the chado db is a single thing. But the GFF/GFA loader will be dealing with three files for each species. So I think we should make this loader append to the indexes rather than recreate them. With a flag to clear them if desired. I'll keep working on loading a single set of files, but that will just place a single species in the Redis. Then I'll switch to append functionality, with the optional clear.

adf-ncgr commented 3 years ago

append option makes good sense here. WRT strain, how about we add the cmdline option for now and just implement it as the same hack we did with chado until the schema supports it as a thing.

alancleary commented 3 years ago

Appending is on my to-do list. In fact, there's already a couple comments about it in the Chado loader:

# TODO: there should be an extend argparse flag that prevents deletion

Currently if the loader sees the keys it needs are already in the DB then it nukes them and starts over, unless you gave the --no-reload flag, then it just fails.

It probably makes sense to write the GFF loader to only load one set of files and then write a bash script to load an entire directory by repeatedly calling the loader.

sammyjava commented 3 years ago

It probably makes sense to write the GFF loader to only load one set of files and then write a bash script to load an entire directory by repeatedly calling the loader.

Exactly, that was my plan.

As for strain, I've dropped it for now, easy to implement if we need it. chado only has those unusual cases using the underscore, so I don't think it's really a Thing That Matters yet. When we add strain to the schema I'll add it as a command line option.

Not surprisingly, the thing I've run up against now is gffutils disliking a line in the LIS annotation GFF I'm loading. Shocking, I know. There are lots of options for the loader to deal with stuff it doesn't like, so I'll get going on that Monday, and then code index append mode and then it should be good to go. I'll test it against all the GFFs I load into the mines, since I expect more complaints to be forthcoming. LIS datastore files are a wonderfully diverse community.

sammyjava commented 3 years ago

FYI, one can use ':memory:' as the filename to store the gffutils SQLLite database in memory.

alancleary commented 3 years ago

FYI, one can use ':memory:' as the filename to store the gffutils SQLLite database in memory.

For now I think it would be safer to just use SQLLite files on disk. In the future, though, it would be interesting to benchmark disk-based files vs in-memory when bulk loading a collection of GFFs. If there's a significance speed-up with in-memory then we could add a command-line flag to let users with sufficient memory use the feature.