legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

Enforce mRNA-only mrna.fna files (with transcript.fna optional) #21

Closed sammyjava closed 2 years ago

sammyjava commented 2 years ago

Some collections have an mrna.fna; others have a transcript.fna. Which shall it be? I'm happy to rename them.

./Bauhinia/variegata/transcriptomes/HK4.tcp1.NMSH/bauva.HK4.tcp1.NMSH.transcript.fna.gz
./Bauhinia/purpurea/transcriptomes/HK2.tcp1.KBWP/baupu.HK2.tcp1.KBWP.transcript.fna.gz
./Bauhinia/tomentosa/transcriptomes/HK3.tcp1.MSDG/bauto.HK3.tcp1.MSDG.transcript.fna.gz
./Bauhinia/blakeana/transcriptomes/HK1.tcp1.CS40/baubl.HK1.tcp1.CS40.transcript.fna.gz
./Lupinus/angustifolius/annotations/Tanjil.gnm1.ann1.nnV9/lupan.Tanjil.gnm1.ann1.nnV9.transcript.fna.gz
./Trifolium/pratense/annotations/MilvusB.gnm2.ann1.DFgp/tripr.MilvusB.gnm2.ann1.DFgp.transcript.fna.gz
./Vigna/unguiculata/annotations/IT97K-499-35.gnm1.ann1.zb5D/vigun.IT97K-499-35.gnm1.ann1.zb5D.transcript.fna.gz
./Vigna/unguiculata/annotations/IT97K-499-35.gnm1.ann2.FD7K/vigun.IT97K-499-35.gnm1.ann2.FD7K.transcript.fna.gz
./Vigna/unguiculata/annotations/CB5-2.gnm1.ann1.0GKC/vigun.CB5-2.gnm1.ann1.0GKC.transcript.fna.gz
./Vigna/unguiculata/annotations/Sanzi.gnm1.ann1.HFH8/vigun.Sanzi.gnm1.ann1.HFH8.transcript.fna.gz
./Vigna/unguiculata/annotations/TZ30.gnm1.ann2.59NL/vigun.TZ30.gnm1.ann2.59NL.transcript.fna.gz
./Vigna/unguiculata/annotations/ZN016.gnm1.ann2.C7YV/vigun.ZN016.gnm1.ann2.C7YV.transcript.fna.gz
./Vigna/unguiculata/annotations/Suvita2.gnm1.ann1.1PF6/vigun.Suvita2.gnm1.ann1.1PF6.transcript.fna.gz
./Vigna/unguiculata/annotations/UCR779.gnm1.ann1.VF6G/vigun.UCR779.gnm1.ann1.VF6G.transcript.fna.gz
./Vigna/angularis/annotations/Gyeongwon.gnm3.ann1.3Nz5/vigan.Gyeongwon.gnm3.ann1.3Nz5.transcript.fna.gz
./Vigna/angularis/annotations/Shumari.gnm1.ann1.8BRS/vigan.Shumari.gnm1.ann1.8BRS.transcript.fna.gz
./Vigna/radiata/annotations/VC1973A.gnm6.ann1.M1Qs/vigra.VC1973A.gnm6.ann1.M1Qs.transcript.fna.gz
./Apios/americana/transcriptomes/LA2155.tcp.BTsx/apiam.LA2155.tcp1.BTsx.transcript.fna.gz
./Cicer/arietinum/annotations/ICC4958.gnm2.ann1.LCVX/cicar.ICC4958.gnm2.ann1.LCVX.transcript.fna.gz
./Cicer/arietinum/annotations/CDCFrontier.gnm1.ann1.nRhs/cicar.CDCFrontier.gnm1.ann1.nRhs.transcript.fna.gz
./Pisum/sativum/annotations/Cameor.gnm1.ann1.7SZR/pissa.Cameor.gnm1.ann1.7SZR.transcript.fna.gz
./Cercis/gigantea/transcriptomes/Sh1.tcp1.9YLH/cergi.Sh1.tcp1.9YLH.transcript.fna.gz
./Phaseolus/lunatus/annotations/G27455.gnm1.ann1.JD7C/phalu.G27455.gnm1.ann1.JD7C.transcript.fna.gz
./Phaseolus/vulgaris/annotations/5-593.gnm1.ann1.3FBJ/phavu.5-593.gnm1.ann1.3FBJ.transcript.fna.gz
./Phaseolus/vulgaris/annotations/UI111.gnm1.ann1.8L4N/phavu.UI111.gnm1.ann1.8L4N.transcript.fna.gz
./Phaseolus/vulgaris/annotations/LaborOvalle.gnm1.ann1.L1DY/phavu.LaborOvalle.gnm1.ann1.L1DY.transcript.fna.gz
./Phaseolus/vulgaris/annotations/G19833.gnm1.ann1.pScz/phavu.G19833.gnm1.ann1.pScz.transcript.fna.gz
./Phaseolus/vulgaris/annotations/G19833.gnm2.ann1.PB8d/phavu.G19833.gnm2.ann1.PB8d.transcript.fna.gz
./Phaseolus/acutifolius/annotations/Frijol_Bayo.gnm1.ann1.ML22/phaac.Frijol_Bayo.gnm1.ann1.ML22.transcript.fna.gz
./Glycine/dolichocarpa/annotations/G1134.gnm1.ann1.4BJM/glydo.G1134.gnm1.ann1.4BJM.transcript.fna.gz
./Glycine/D3-tomentella/annotations/G1403.gnm1.ann1.XNZQ/glyd3.G1403.gnm1.ann1.XNZQ.transcript.fna.gz
./Glycine/syndetika/annotations/G1300.gnm1.ann1.RRK6/glysy.G1300.gnm1.ann1.RRK6.transcript.fna.gz
./Glycine/stenophita/annotations/G1974.gnm1.ann1.F257/glyst.G1974.gnm1.ann1.F257.transcript.fna.gz
./Glycine/falcata/annotations/G1718.gnm1.ann1.2KSV/glyfa.G1718.gnm1.ann1.2KSV.transcript.fna.gz
./Glycine/soja/annotations/W05.gnm1.ann1.T47J/glyso.W05.gnm1.ann1.T47J.transcript.fna.gz
./Glycine/soja/annotations/PI483463.gnm1.ann1.3Q3Q/glyso.PI483463.gnm1.ann1.3Q3Q.transcript.fna.gz
./Glycine/max/annotations/Zh13.gnm1.ann1.8VV3/glyma.Zh13.gnm1.ann1.8VV3.transcript.fna.gz
./Glycine/max/annotations/FiskebyIII.gnm1.ann1.SS25/glyma.FiskebyIII.gnm1.ann1.SS25.transcript.fna.gz
./Glycine/max/annotations/Zh13.gnm2.ann1.FJ3G/glyma.Zh13.gnm2.ann1.FJ3G.transcript.fna.gz
./Glycine/max/annotations/Wm82.gnm2.ann1.RVB6/glyma.Wm82.gnm2.ann1.RVB6.transcript.fna.gz
./Glycine/max/annotations/Wm82.gnm4.ann1.T8TQ/glyma.Wm82.gnm4.ann1.T8TQ.transcript.fna.gz
./Glycine/max/annotations/Lee.gnm1.ann1.6NZV/glyma.Lee.gnm1.ann1.6NZV.transcript.fna.gz
./Glycine/cyrtoloba/annotations/G1267.gnm1.ann1.HRFD/glycy.G1267.gnm1.ann1.HRFD.transcript.fna.gz
./Lotus/japonicus/annotations/MG20.gnm3.ann1.WF9B/lotja.MG20.gnm3.ann1.WF9B.transcript.fna.gz
./Cajanus/cajan/annotations/ICPL87119.gnm1.ann1.Y27M/cajca.ICPL87119.gnm1.ann1.Y27M.transcript.fna.gz
./Medicago/truncatula/annotations/A17.gnm5.ann1_6.L2RX/medtr.A17.gnm5.ann1_6.L2RX.transcript.fna.gz
./Medicago/sativa/annotations/XinJiangDaYe.gnm1.ann1.RKB9/medsa.XinJiangDaYe.gnm1.ann1.RKB9.transcript.fna.gz
./Aeschynomene/evenia/annotations/CIAT22838.gnm1.ann1.ZM3R/aesev.CIAT22838.gnm1.ann1.ZM3R.transcript.fna.gz
./Arachis/ipaensis/annotations/K30076.gnm1.ann1.J37m/araip.K30076.gnm1.ann1.J37m.transcript.fna.gz
./Arachis/duranensis/annotations/V14167.gnm1.ann1.cxSM/aradu.V14167.gnm1.ann1.cxSM.transcript.fna.gz
./Arachis/hypogaea/annotations/Tifrunner.gnm1.ann1.CCJH/arahy.Tifrunner.gnm1.ann1.CCJH.transcript.fna.gz
./Arachis/hypogaea/annotations/Tifrunner.gnm2.ann1.4K0L/arahy.Tifrunner.gnm2.ann1.4K0L.transcript.fna.gz
[shokin@lis ~/v2]$ find . -name *.mrna.fna.gz
./Glycine/soja/annotations/F_IGA1003.gnm1.ann1.G61B/glyso.F_IGA1003.gnm1.ann1.G61B.mrna.fna.gz
./Glycine/max/annotations/Hefeng25_IGA1002.gnm1.ann1.320V/glyma.Hefeng25_IGA1002.gnm1.ann1.320V.mrna.fna.gz
./Glycine/max/annotations/Huaxia3_IGA1007.gnm1.ann1.LKC7/glyma.Huaxia3_IGA1007.gnm1.ann1.LKC7.mrna.fna.gz
./Glycine/max/annotations/Jinyuan_IGA1006.gnm1.ann1.2NNX/glyma.Jinyuan_IGA1006.gnm1.ann1.2NNX.mrna.fna.gz
./Glycine/max/annotations/Zh35_IGA1004.gnm1.ann1.RGN6/glyma.Zh35_IGA1004.gnm1.ann1.RGN6.mrna.fna.gz
./Glycine/max/annotations/Zh13_IGA1005.gnm1.ann1.87Z5/glyma.Zh13_IGA1005.gnm1.ann1.87Z5.mrna.fna.gz
./Glycine/max/annotations/Wenfeng7_IGA1001.gnm1.ann1.ZK5W/glyma.Wenfeng7_IGA1001.gnm1.ann1.ZK5W.mrna.fna.gz
./Glycine/max/annotations/Wm82_IGA1008.gnm1.ann1.FGN6/glyma.Wm82_IGA1008.gnm1.ann1.FGN6.mrna.fna.gz
./Medicago/truncatula/annotations/R108_HM340.gnm1.ann1.85YW/medtr.R108_HM340.gnm1.ann1.85YW.mrna.fna.gz
./Medicago/truncatula/annotations/HM058.gnm1.ann1.LXPZ/medtr.HM058.gnm1.ann1.LXPZ.mrna.fna.gz
./Medicago/truncatula/annotations/HM004.gnm1.ann1.2XTB/medtr.HM004.gnm1.ann1.2XTB.mrna.fna.gz
./Medicago/truncatula/annotations/HM324.gnm1.ann1.SQH2/medtr.HM324.gnm1.ann1.SQH2.mrna.fna.gz
./Medicago/truncatula/annotations/HM129.gnm1.ann1.7FTD/medtr.HM129.gnm1.ann1.7FTD.mrna.fna.gz
./Medicago/truncatula/annotations/HM095.gnm1.ann1.55W4/medtr.HM095.gnm1.ann1.55W4.mrna.fna.gz
./Medicago/truncatula/annotations/HM023.gnm1.ann1.WZN8/medtr.HM023.gnm1.ann1.WZN8.mrna.fna.gz
./Medicago/truncatula/annotations/HM056.gnm1.ann1.CHP6/medtr.HM056.gnm1.ann1.CHP6.mrna.fna.gz
./Medicago/truncatula/annotations/HM034.gnm1.ann1.YR6S/medtr.HM034.gnm1.ann1.YR6S.mrna.fna.gz
./Medicago/truncatula/annotations/HM125.gnm1.ann1.KY5W/medtr.HM125.gnm1.ann1.KY5W.mrna.fna.gz
./Medicago/truncatula/annotations/HM050.gnm1.ann1.GWRX/medtr.HM050.gnm1.ann1.GWRX.mrna.fna.gz
./Medicago/truncatula/annotations/HM060.gnm1.ann1.H41P/medtr.HM060.gnm1.ann1.H41P.mrna.fna.gz
./Medicago/truncatula/annotations/HM010.gnm1.ann1.WV9J/medtr.HM010.gnm1.ann1.WV9J.mrna.fna.gz
./Medicago/truncatula/annotations/A17_HM341.gnm4.ann2.G3ZY/medtr.A17_HM341.gnm4.ann2.G3ZY.mrna.fna.gz
./Medicago/truncatula/annotations/HM185.gnm1.ann1.GB3D/medtr.HM185.gnm1.ann1.GB3D.mrna.fna.gz
./Medicago/truncatula/annotations/HM022.gnm1.ann1.6C8N/medtr.HM022.gnm1.ann1.6C8N.mrna.fna.gz
sammyjava commented 2 years ago

FWIW the spec says transcript.fna, but that's easily changed.

adf-ncgr commented 2 years ago

I think we're supposed to follow phytozome naming conventions and use transcript. I've contributed to heterogeneity in this regard (e.g. all those medicago files), but I think transcript is probably better since some of our files likely have noncoding transcripts in them.

sammyjava commented 2 years ago

Works for me, I'll add a task to rename the mrna.fna files (which there are fewer of, so good choice in that regard).

sammyjava commented 2 years ago

Although, I do load them into the mines as the MRNA class, not the Transcript superclass. So those ncRNAs should probably just be in the GFF and not mixed with mRNAs in a FASTA. But no one cares.

adf-ncgr commented 2 years ago

But no one cares.

I care (but not much.) Probably the only reason I mention it is that I think gffread does produce transcript sequences for other transcript classes (though I can't remember if that's just because it grabs exons or because it actually waxes ontological). And I often resort to using gffread when dealing with new annotations sets

StevenCannon-USDA commented 2 years ago

I'm OK with s/mrna.fna/transcript.fna/

Here are the variations we have in annotation fna files:

ls */*/annotations/*/*fna.gz | cut -f5 -d'/' | perl -pe 's/\./\t/g' | cut -f6 | sort | uniq -c 
  76 cds
   2 cds_low_confidence
  37 cds_primaryTranscript
   1 cds_primaryTranscript_low_confidence
   1 fna
   1 gene_filterTE
   1 gene_main
   1 genes
  24 mrna
   2 mRNA
   1 primaryTranscript
  46 transcript
   1 transcript_low_confidence
   2 transcript_lowqual_or_TE
  31 transcript_primaryTranscript
   1 transcript_primaryTranscript_low_confidence
   1 transcripts

Relatedly: I wish I had specified just "_primary" rather than "_primaryTranscript" - because "Transcript" is gratuitous and potentially confusing. This followed the Phytozome pattern: cds_primaryTranscriptOnly ... dropping "Only" I've not thought that the change would be worth the cost, but I'll ask your opinions, @adf-ncgr and @sammyjava

sammyjava commented 2 years ago

Speaking of gffread, I just (finally!) wrote a standalone GFF mine loader because I've finally had it with the stock IM loader which is incompatible with using READMEs for the metadata. I use the Biojava GFFReader which is pretty straightforward, just returns a FeatureI with what's on the GFF line. My GFF loader is 200 lines, plus the underlying DatastoreFileConverter that it extends that handles the README, while the stock IM GFF loader is an impossibly complex hairball of 10 classes which still requires that you write two custom handlers.

sammyjava commented 2 years ago

As for FASTA content, be it known @cann0010 that the mine FASTA loader only loads one type of object per file. What is in the file is specified externally: be it MRNA (which is what I use for all transcript.fna files), CDS, Protein. So a FASTA with mixed types is unsupported - they will all be loaded into a single mine class. I could load Transcript rather than MRNA into the mines (MRNA is a subclass of Transcript), but then we'd have mRNAs from the GFFs and Transcripts from the FASTAs with similar identifiers, which seems like a bad idea. (FYI I store the length from the FASTA so it's the actual transcript length, not the length of the full span on the chromosome.)

StevenCannon-USDA commented 2 years ago

@sammyjava - "only loads one type of object per file." What are you recommending then (regarding transcripts vs. mRNAs)? If we want to hew to the SO terminology (GFF column 3), then we would go with mRNA. That would give a less ambiguous correspondence to the GFF. I think the term "transcript" is pretty soft (ill-defined). I take it as basically "splice-variant form of the gene", probably including the UTRs - but as a modifier for "protein" or "cds," it just means "splice-variant." It is also the probably the least frequently used data type from the annotation files. It might be used if someone wanted to design primers that spanned a UTR, but not for routine evolutionary analysis.

sammyjava commented 2 years ago

Well, I've always loaded strictly MRNA objects from the file, whatever it is is called (just as all sequences from cds.fna are loaded into CDS objects and all sequences from protein.faa are loaded into Protein objects.)

So it seems clearer if we have an explicit mrna.fna file in each annotation collection that contains only mRNA sequences; I'll then ignore a separate transcript.fna that may contain other stuff. I'm glad we're having this conversation as I think it's clarifying what is in those files in various collections.

PROPOSAL: every annotation collection must contain an mrna.fna.gz file that contains only mRNA sequences. An additional transcript.fna.gz file is optional and will be ignored in mine loading.

StevenCannon-USDA commented 2 years ago

I've voted "yea", but now there are some devils in the details:

  1. If we can determine that the current transcript.fna files correspond with mRNA features (I suspect they do - at least for files from JGI), then do we rename these to mrna.fna ?
  2. Do we call these "mrna" or "mRNA"? (unix-wise, I prefer lowercase; biology-wise and SO-wise, I prefer mRNA).
  3. And: if we are going to mess with all of these annotation collections now, then I would also like to propose global s/_primaryTranscript/_primary/ . This would apply to protein, cds, and mrna. (The _primary files are important for calculating pan-gene sets, gene families, etc.) If we go with this simplification, I'll volunteer to make that change. (Not sure that I volunteer for regenerating the mRNA files, as I have some big administrative tasks coming up in the next month).
sammyjava commented 2 years ago
  1. YES
  2. "mrna" for lower-case consistency with everything else (other than strain names and KEY4).
  3. Go for it. And FYI, I load the full FASTAs, not the primary ones, we figured it's nice to have the variants in the mines. (That can be revisited, of course.)
adf-ncgr commented 2 years ago

OK by me. One point of clarification, @sammyjava correct me if I'm wrong but I believe that if the mine loader encounters a sequence in an mrna file that doesn't correspond to an mrna in the gff file, it will create a new record rather than fail, which is not the end of the world (failure to load would be more doomsday-like). I know we have some transcript files that contain non-mrna things (e.g. tRNA); we can probably id such things by comparing counts of records in cds fasta with the corresponding mrna files. But in the likely event we don't get everything right away, I guess nothing is going to come to a grinding halt.

sammyjava commented 2 years ago

Correct, you'd just create an orphan mRNA object that has no chromosome location (and therefore won't show on the JBrowse). The FASTA/GFF objects are merged on (1) being the same class and (2) full-yuck identifier. So in the case of legit objects in both FASTA and GFF, we get a bit from the GFF (location,parent) while the sequence and length come from the FASTA because the identifiers match up. (The transfer-sequences post-process skips them because they already have sequences from the FASTA.)

sammyjava commented 2 years ago

Steven, shall I task you with the s/transcript/mrna/ and s/_primaryTranscript/_primary/ tasks since there may be a bit of content-verification involved? Otherwise I'm happy to just do the renaming, that's easy and changes nothing other than the checksum.

sammyjava commented 2 years ago

Also, just so it's known to all: I do load all RNA types from the GFFs (tRNA, ncRNA, miRNA, you name it). But the GFF loader loads each record into an InterMine object of the class associated with that type. And then, since non-mRNA objects aren't loaded from FASTAs, the sequences are derived from the chromosomal span (and strand) in the transfer-sequences post-processor. So if we were keen to load, say, ncRNA sequences from FASTAs, we certainly can do that by putting them in their own file. It's just that the FASTA loader loads one class per file.

StevenCannon-USDA commented 2 years ago

Yes, I can do that renaming. An associated task is to update the two MANIFEST files. These don't currently have great utility, because they haven't been produced with great rigor; on the other hand, they do record prior filenames, which I think is important for provenance. For this minor renaming, I think I'll just change the "current filename" in the first field - as opposed to adding another "previous filename" field. One more thing: I think we ought to notify the group (via lis-developers) of the changes. I'll do that now.

nathanweeks commented 2 years ago

Yes, I can do that renaming. An associated task is to update the two MANIFEST files.

The CHECKSUM files need to be updated as well.

StevenCannon-USDA commented 2 years ago

CHECKSUM updates - yes. Will do that later today, after some more changes and checks.

sammyjava commented 2 years ago

This was done, closing.