Closed alancleary closed 4 years ago
Ohhh I have to write it in Python. I'm a bad choice, my Python skills are next to zero. I thought I got to choose the language.
If it were a microservice you were going to take ownership of then you would get to choose the language. However, the language of this project for all things not user interface or microservice related is Python.
Are you sure you don't want to take a stab at it? I'm pretty sure you could use a Biopython library for the GFF parsing and the Chado loader should show you how the data needs to be handled.
Sure, I'll take a stab at it in Python.
I feel like I'm bearing witness to the blossoming of a benevolent tyranny. I suspect the tyrant is having his benevolent revenge on certain perl scripts he was once forced to write in support of such tasks back in the day. PEP8 or PERISH!
Nah, no underlying tyrannical agenda, I just haven't written a ton of Python, and originally thought I was going to write it in Java. All grievances with Perl were put to rest by the Revolution (and subsequent Civil War).
@sammyjava @adf-ncgr The Redis schema is good to go. We should discuss what structure both the Chado and GFF loaders can put data into for the Redis loading functions and then I'll encapsulate/update that code so it can be used by both loaders.
Since the lack of a loader is blocking #11, if it's much more expedient for @sammyjava to bang out an initial loader in Java, it should be straightforward to containerize and add as a service (say, "load_gff") in the Docker Compose file that @alancleary added... and we can translate later if need be (IMHO)
I'm fine with Python, Java, whatever. if there's a reason to go with Python. If it absolutely does not matter, then my preference would be Java, but it sounds like Python makes more sense.
As far as the master branch of the main GCV repo is concerned, the Docker compose files and microservices still work. So unless @adf-ncgr or some other authority within the project is itching to get microservices epic out the door, I think we should just do it in Python. That way if I decide to change the Redis schema in the near future only the Redis loader that the GFF and Chado loaders share will need to be modified, i.e. Sam doesn't have to learn the underlying Redis schema or take ownership of the GFF loader.
I am not itching yet, and the benevolence of your dictatorship still leads us on...
Doing it the "right" way (in Python) the first time is fine by me. I just wasn't sure if we were doing the equivalent of asking James Gosling to go learn Python before contributing loader scripts.
I'm getting on this. Just to confirm, @alancleary : write gff_to_redisearch.py
which mimics chado_to_redisearch.py
by loading chromosomes (transferChromosomes
) and genes (transferGenes
) from a GFF, buth run in transferData
. You don't need sequences, so I don't need the genomic FASTA for the chromosomes.
We do need to know how long the chromosomes (or supercontigs) are, though. Most of our gffs don't encode that info, methinks. Could be gotten from the fai files.
@sammyjava That's correct. I still intend to separate the code that puts data into Redis from the code that parses/loads the data from wherever. Ideally the prior will be shared by the two loading scripts so maintenance is easy. For now lets just get GFF loading running and we'll worry about convergence later.
We do need to know how long the chromosomes (or supercontigs) are, though. Most of our gffs don't encode that info, methinks. Could be gotten from the fai files.
Well, they should. I say we add the chromosomes to the GFFs. That's pretty commonly done, both through the headers and regular lines. I can handle that. To wit:
##gff-version 3
##sequence-region 1 1 307041717
##sequence-region 10 1 150982314
##sequence-region 2 1 244442276
##sequence-region 3 1 235667834
##sequence-region 4 1 246994605
##sequence-region 5 1 223902240
##sequence-region 6 1 174033170
##sequence-region 7 1 182381542
##sequence-region 8 1 181122637
##sequence-region 9 1 159769782
##sequence-region B73V4_ctg1 1 50531
...
1 wareLab chromosome 1 307041717 . . . ID=chromosome:1
10 wareLab chromosome 1 150982314 . . . ID=chromosome:10
2 wareLab chromosome 1 244442276 . . . ID=chromosome:2
3 wareLab chromosome 1 235667834 . . . ID=chromosome:3
4 wareLab chromosome 1 246994605 . . . ID=chromosome:4
5 wareLab chromosome 1 223902240 . . . ID=chromosome:5
6 wareLab chromosome 1 174033170 . . . ID=chromosome:6
7 wareLab chromosome 1 182381542 . . . ID=chromosome:7
8 wareLab chromosome 1 181122637 . . . ID=chromosome:8
9 wareLab chromosome 1 159769782 . . . ID=chromosome:9
Mt wareLab chromosome 1 569630 . . . ID=chromosome:Mt;Alias=AY506529.1;Is_circular=true
Pt wareLab chromosome 1 140384 . . . ID=chromosome:Pt;Alias=X86563.2;Is_circular=true
B73V4_ctg1 wareLab contig 1 50531 . . . ID=contig:B73V4_ctg1
I guess we should pick between the terms "contig" and "supercontig" in the GFF records, which I realize we need to distinguish them from chromosomes. I'd lean toward "contig" since they can be any size, and that's what Doreen Ware does, and that's good enough for me.
Agree: it is often useful to have the chromosome (and scaffold) sizes. We've been adding those in VCF files; might as well in the GFFs too. Regarding "contig" vs. "scaffold" or "supercontig": for most assemblies we deal with, "scaffold" is the most common descriptor (scaffold usually being composed of contigs). We occasionally see "contig" in small leftover stuff that scaffolders weren't able to pull into a larger unit. Unless it causes problems, I would try to stick with the terms we see in the assemblies - i.e. chromosome, scaffold, (occasionally) contig.
I think we were previously leaning towards putting per-sequence metadata like type and original names into a separate file ala NCBI's assembly reports. I don't think we ought to make new requirements/changes to our gffs without an RFO. For example, a browser that was feeding directly off said gffs might end up getting a big bar spanning each reference sequence (unless we did this as "headers" rather than features); likely no lives would be lost, but still... and in any case, our censors should be notified of the new requirements if they exist.
VCF spec requires the refseqs in the header, so it's a somewhat different kettle of beans, methinks.
It's a spec update either way. I'm fine with creating new GFFs, simpler, which should be in the assembly, not annotation, directories alongside the FASTAs. Shall I create said spec in the usual place? Here's an example: Phaseolus_vulgaris/G19833.gnm2.fC0g/phavu.G19833.gnm2.fC0g.genome_main.gff3
##gff-version 3
phavu.G19833.gnm2.Chr01 LIS chromosome 1 51433939 . . . ID=phavu.G19833.gnm2.Chr01
phavu.G19833.gnm2.Chr02 LIS chromosome 1 49670989 . . . ID=phavu.G19833.gnm2.Chr02
phavu.G19833.gnm2.Chr03 LIS chromosome 1 53438756 . . . ID=phavu.G19833.gnm2.Chr03
phavu.G19833.gnm2.Chr04 LIS chromosome 1 48048378 . . . ID=phavu.G19833.gnm2.Chr04
phavu.G19833.gnm2.Chr05 LIS chromosome 1 40923498 . . . ID=phavu.G19833.gnm2.Chr05
phavu.G19833.gnm2.Chr06 LIS chromosome 1 31236378 . . . ID=phavu.G19833.gnm2.Chr06
phavu.G19833.gnm2.Chr07 LIS chromosome 1 40041001 . . . ID=phavu.G19833.gnm2.Chr07
phavu.G19833.gnm2.Chr08 LIS chromosome 1 63048260 . . . ID=phavu.G19833.gnm2.Chr08
phavu.G19833.gnm2.Chr09 LIS chromosome 1 38250102 . . . ID=phavu.G19833.gnm2.Chr09
phavu.G19833.gnm2.Chr10 LIS chromosome 1 44302882 . . . ID=phavu.G19833.gnm2.Chr10
phavu.G19833.gnm2.Chr11 LIS chromosome 1 53580169 . . . ID=phavu.G19833.gnm2.Chr11
phavu.G19833.gnm2.scaffold_12 LIS scaffold 1 1160169 . . . ID=phavu.G19833.gnm2.scaffold_12
phavu.G19833.gnm2.scaffold_13 LIS scaffold 1 1050217 . . . ID=phavu.G19833.gnm2.scaffold_13
phavu.G19833.gnm2.scaffold_14 LIS scaffold 1 837347 . . . ID=phavu.G19833.gnm2.scaffold_14
...
works for me; coincidentally, that's exactly how I go about loading minimalist chadoi for GCV purposes.
@adf-ncgr - hmm. Good points & reminder. I suppose yes: a significant new requirement for the GFFs shouldn't be undertaken lightly. For example, the extra stuff couldn't just be tacked on to the front, if we require them to be sorted. Looking through our current assembly files: I find no distinct GFFs for the assembly sizes - even though several applications use those (at least GBrowse and GCViT). Filename proposal: gensp.accsn.gnm#.KEY4.chr_scaff_sizes.gff3
I have put my little molecule-size-calculating script at /usr/local/www/data/about_the_data_store/scripts/chrlen_to_gff.sh
Usage:
chrlen_to_gff.sh *genome_main.fna glyma.Lee.gnm1 > gensp.accsn.gnm#.KEY4.chr_scaff_sizes.gff3
I looked at the SO Wiki re. landmark features, and no specific guidance is given, other than to skip the ##sequence-region pragma as being "poor practice": http://www.sequenceontology.org/so_wiki/index.php/Landmark_features
However, I will note that in LIS chado, we store them exclusively as "supercontig", which is also what they're called in the mines. So I recommend we call the non-chromosome feature "supercontig" for LIS consistency. We're using these GFFs for our own purposes, they don't need to reflect what the originators called them, which can be a variety of things, which gets confusing.
lis-chado=> select count(*) from feature where type_id =43174; *supercontig*
count
--------
263674
(1 row)
lis-chado=> select count(*) from feature where type_id =42961; *contig*
count
-------
0
(1 row)
lis-chado=> select count(*) from feature where type_id =48866; *scaffold*
count
-------
0
@cann0010 why not simply gensp.accsn.gnm.KEY4.genome_main.gff3
? They are GFFs derived from the corresponding FASTA, no added information. chr_scaff_sizes
just adds an extra nugget that doesn't seem to add any information to me.
It shouldn't be confused with the annotation GFF because (1) it's in a different directory and (2) it doesn't contain ann
.
@sammyjava OK :-) gensp.accsn.gnm.KEY4.genome_main.gff3 ... and supercontig
I agree with @sammyjava proposal for naming; as far as SO terms for types are concerned, I should note that I have recently started using mitochondrial_chromosome and chloroplast_chromosome. Because I aspire to biological semantic purity (but don't think scaffold vs contig vs whatever qualifies as biological...)
Hrm, @adf-ncgr usually loaders treat non-nuclear as "chromosome", and GFFs typically label them as such, to wit:
Mt wareLab chromosome 1 569630 . . . ID=chromosome:Mt;Alias=AY506529.1;Is_circular=true
Pt wareLab chromosome 1 140384 . . . ID=chromosome:Pt;Alias=X86563.2;Is_circular=true
or
Mt TAIR10 chromosome 1 366924 . . . ID=chromosome:Mt;Alias=Y08501.2,NC_001284.2;Is_circular=true
Pt TAIR10 chromosome 1 154478 . . . ID=chromosome:Pt;Alias=AP000423.1,NC_000932.1;Is_circular=true
which I guess may be an Ensembl standard since these are the same. Which is a good standard IMO. By using the child terms you may risk loaders dropping them. Or having to code in two extra feature types to pull from GFFs. Although I appload your ontological exactness.
chromosome (SO:0000340)
-- chloroplast_chromosome (SO:0000820)
-- mitochondrial_chromosome (SO:0000819)
@sammyjava I might argue that they are sufficiently different from nuclear chromosomes that some applications would want to ignore them. But, I don't feel super-strongly about this issue, so let's hear what @cann0010 thinks before coming to a final decision
I'll spec it anyway, which requires a papal blessing.
I didn't expect the Soybase Inquisition! ;)
I don't think we have any cases of the mitochondrial or chloroplast genomes being included (intentionally) in genome_main. I think Phytozome usually splits them out into separate files. So, maybe moot - but I suppose could be spec'd "in case."
pretty sure we have a couple (though maybe not from Phytozome). Lupinus albus, and Medicago truncatula come to mind. NCBI adds them into the RefSeq representation even when excluded from the base assembly chosen to be the RefSeq representative version.
Yeah, I've spec'ed it to use "chromosome" for those until otherwise overridden. Here's an example that @adf-ncgr mentions:
>lupal.Amiga.gnm1.Lalb_Chr01 len=23521228
>lupal.Amiga.gnm1.Lalb_Chr02 len=17964110
...
>lupal.Amiga.gnm1.Lalb_Chr25 len=16405666
>lupal.Amiga.gnm1.Lalb_Chr00c01 len=1566442
...
>lupal.Amiga.gnm1.Lalb_Chr00c64 len=17842
>lupal.Amiga.gnm1.Lalb_chloroplast_genome
>lupal.Amiga.gnm1.Lalb_mitochondrion
@alancleary I see that your Redis schema includes only name, length, genus, and species:
fields = [
redisearch.TextField('name'),
redisearch.NumericField('length'),
redisearch.TextField('genus'),
redisearch.TextField('species'),
]
while in LIS we have accessions/strains, assembly versions (gnm), and annotation versions (ann). I could see storing only a single assembly+annotation version per strain, but we certainly have many strains per species.
Do you want me to concatenate species_strain (vulgaris_G19833
) as done in chado, or shall we begin storing strains as an additional field? And skip recording the assembly and annotation versions? I'm fine with either, but the chado thing of species_strain is kinda kludgey.
agreed that what we did in chado is a kludge (to chado's credit, there is an updated version of schema that contains an actual field for "infraspecies" or some such moniker).
whatever we do (including with genus and species) we should think about how it will be encoded in the source files (and make this "friendly" to potential adopters of GCV who aren't necessarily into our metadata/naming conventions). The gff3 spec has some "pragmas" about some of this, but I'm not sure that's necessarily the best approach.
@alancleary OH we need gene families, of course. That's a third file. Just making a note of it here. The "GFF" loader loads three files: chromosomes (GFF), genes (GFF), and gene family assignments (GFA).
@alancleary hit a snag already. I've installed all the requirements via pip3; the only one already on my system is six, 1.14.0 rather than 1.15.0. Oh, and it's python 3.8.2.
shokin@morangie:~/gcv-microservices/database$ ./chado_to_redisearch.py
File "./chado_to_redisearch.py", line 39
parser.add_argument('--pdb', action=EnvArg, envvar=pdb_envvar, type=str, default='chado', help=f'The PostgreSQL database (can also be specified using the {pdb_envvar} environment variable).')
^
SyntaxError: invalid syntax
are you sure the python in your PATH is 3.8.2? the script seems to be set up to use whatever you have:
probably expecting it to be in a virtualenv try invoking explicitly with python3 interpreter and see what happens
Sorry for the delay.
I'm not exactly crazy about changing the schema at the moment so for the time being why don't we encode that sort of info in the name as is being done by LIS. Ideally that'll allow LIS to use the GFF loader rather than the Chado loader and GCV will still be able to inter-operate with other tools in LIS using names. However, I think the proper formatting of the names should be left as a task for the user since not everyone in the world is going to have/want the same naming scheme as LIS.
Regarding dependencies, you should use Python virtual environments when developing in Python so you don't have to deal with system packages conflicting with the dependencies of whatever you're working on.
Got it. A bit confused by the choice of pipenv vs. virtualenv -- which do you use? Is there a way to automatically install the requirements, or is requirements.txt just a list for me to install manually? (I had to manually pip install a few, but not all.) I've added gffutils=0.10.1 and biopython=1.78 to database/requirements.txt.
FYI.
(gcv-microservices) shokin@morangie:~/gcv-microservices$ pip list
Package Version
----------- ---------
argcomplete 1.12.1
argh 0.26.2
biopython 1.78
certifi 2020.6.20
chardet 3.0.4
gffutils 0.10.1
hiredis 1.1.0
idna 2.10
numpy 1.19.3
pip 20.2.4
psycopg2 2.8.6
pyfaidx 0.5.9.1
redis 3.5.3
redisearch 2.0.0
requests 2.24.0
rmtest 0.7.0
setuptools 50.3.2
simplejson 3.17.2
six 1.15.0
urllib3 1.25.11
wheel 0.35.1
I'm not exactly crazy about changing the schema at the moment so for the time being why don't we encode that sort of info in the name as is being done by LIS. Ideally that'll allow LIS to use the GFF loader rather than the Chado loader and GCV will still be able to inter-operate with other tools in LIS using names. However, I think the proper formatting of the names should be left as a task for the user since not everyone in the world is going to have/want the same naming scheme as LIS.
So by that I presume you mean add them as command-line parameters: --genus=Phaseolus --species=vulgaris --strain=G19833
. Works for me. genus and species required, strain optional.
Whoops. That's not the article I meant to link. Anyway, I use virtualenv:
$ virtualenv venv
$ . ./venv/bin/activate
(venv) $ pip install -r requirements.txt
If you want to update what's in the requirements file:
(venv) $ pip freeze > requirements.txt
I must've misunderstood what you were saying earlier about genus/species/strain. If they're not provided in the GFF then I would add them as command-line parameters/arguments as you suggested, though strain currently isn't encoded in the schema and I'm not worried about it for the time being.
Thanks for the virtualenv tips.
I must've misunderstood what you were saying earlier about genus/species/strain. If they're not provided in the GFF then I would add them as command-line parameters/arguments as you suggested, though strain currently isn't encoded in the schema and I'm not worried about it for the time being.
OK, I'll leave strain out for now, then.
However, the LIS datastore has many strains per species. We need to incorporate strain somehow in the schema. (And perhaps even genome and assembly versions, since we have multiple of those per strain in a few cases.)
Here is what's currently in LegumeMine (just to hammer the point home):
genus | species | strain
-----------+---------------+--------------
Arachis | duranensis | V14167
Arachis | hypogaea | Tifrunner
Arachis | ipaensis | K30076
Cajanus | cajan | ICPL87119
Cicer | arietinum | CDCFrontier
Cicer | arietinum | ICC4958
Glycine | max | Lee
Glycine | max | Wm82
Glycine | max | Zh13
Glycine | soja | PI483463
Glycine | soja | W05
Lotus | japonicus | MG20
Lupinus | albus | Amiga
Lupinus | angustifolius | Tanjil
Medicago | sativa | XinJiangDaYe
Medicago | truncatula | A17_HM341
Medicago | truncatula | jemalong_A17
Phaseolus | lunatus | G27455
Phaseolus | vulgaris | G19833
Phaseolus | vulgaris | UI111
Trifolium | pratense | MilvusB
Vigna | angularis | Gyeongwon
Vigna | angularis | Shumari
Vigna | radiata | VC1973A
Vigna | unguiculata | CB5-2
Vigna | unguiculata | IT97K-499-35
Vigna | unguiculata | Sanzi
Vigna | unguiculata | Suvita2
Vigna | unguiculata | TZ30
Vigna | unguiculata | UCR779
Vigna | unguiculata | Xiabao_II
Vigna | unguiculata | ZN016
@alancleary could you give me the commands (FT.CREATE, etc.) you use to initialize the Redis indexes with the appropriate parameters? I need to create the chromosomeIdx, etc. and I don't see any scripts for doing that in the repo. Thanks!
Nevermind, I see that I had to run it twice. Or I may have had the relevant line commented out. It's there now in any case.
Actually that brings up a point. Your script loads the entire chado database in one swell foop. Which makes sense since the chado db is a single thing. But the GFF/GFA loader will be dealing with three files for each species. So I think we should make this loader append to the indexes rather than recreate them. With a flag to clear them if desired. I'll keep working on loading a single set of files, but that will just place a single species in the Redis. Then I'll switch to append functionality, with the optional clear.
append option makes good sense here. WRT strain, how about we add the cmdline option for now and just implement it as the same hack we did with chado until the schema supports it as a thing.
Appending is on my to-do list. In fact, there's already a couple comments about it in the Chado loader:
# TODO: there should be an extend argparse flag that prevents deletion
Currently if the loader sees the keys it needs are already in the DB then it nukes them and starts over, unless you gave the --no-reload flag, then it just fails.
It probably makes sense to write the GFF loader to only load one set of files and then write a bash script to load an entire directory by repeatedly calling the loader.
It probably makes sense to write the GFF loader to only load one set of files and then write a bash script to load an entire directory by repeatedly calling the loader.
Exactly, that was my plan.
As for strain, I've dropped it for now, easy to implement if we need it. chado only has those unusual cases using the underscore, so I don't think it's really a Thing That Matters yet. When we add strain to the schema I'll add it as a command line option.
Not surprisingly, the thing I've run up against now is gffutils disliking a line in the LIS annotation GFF I'm loading. Shocking, I know. There are lots of options for the loader to deal with stuff it doesn't like, so I'll get going on that Monday, and then code index append mode and then it should be good to go. I'll test it against all the GFFs I load into the mines, since I expect more complaints to be forthcoming. LIS datastore files are a wonderfully diverse community.
FYI, one can use ':memory:' as the filename to store the gffutils SQLLite database in memory.
FYI, one can use ':memory:' as the filename to store the gffutils SQLLite database in memory.
For now I think it would be safer to just use SQLLite files on disk. In the future, though, it would be interesting to benchmark disk-based files vs in-memory when bulk loading a collection of GFFs. If there's a significance speed-up with in-memory then we could add a command-line flag to let users with sufficient memory use the feature.
As part of the move away from Chado via microservices we need to implement a script that loads data into the GCV Redis database directly from GFF files. There is already a script that loads data from Chado. This will be maintained for migration and adoptability purposes. As such, I'll isolate the part of that script that actually puts data into Redis into its own module so both scripts can use the code. This will make updating the loading scripts when the Redis schema changes a matter of changing one file.
The GFF loader should be written in Python, use spaces instead of tabs, and conform to the PEP8 code style guide. Also, use of the loader should be documented in the README.md file of the scripts directory.
This issue is blocked by issue #9 and shouldn't be started until #9 is closed.