Closed sammyjava closed 2 years ago
FWIW the spec says transcript.fna, but that's easily changed.
I think we're supposed to follow phytozome naming conventions and use transcript. I've contributed to heterogeneity in this regard (e.g. all those medicago files), but I think transcript is probably better since some of our files likely have noncoding transcripts in them.
Works for me, I'll add a task to rename the mrna.fna files (which there are fewer of, so good choice in that regard).
Although, I do load them into the mines as the MRNA class, not the Transcript superclass. So those ncRNAs should probably just be in the GFF and not mixed with mRNAs in a FASTA. But no one cares.
But no one cares.
I care (but not much.) Probably the only reason I mention it is that I think gffread does produce transcript sequences for other transcript classes (though I can't remember if that's just because it grabs exons or because it actually waxes ontological). And I often resort to using gffread when dealing with new annotations sets
I'm OK with s/mrna.fna/transcript.fna/
Here are the variations we have in annotation fna files:
ls */*/annotations/*/*fna.gz | cut -f5 -d'/' | perl -pe 's/\./\t/g' | cut -f6 | sort | uniq -c
76 cds
2 cds_low_confidence
37 cds_primaryTranscript
1 cds_primaryTranscript_low_confidence
1 fna
1 gene_filterTE
1 gene_main
1 genes
24 mrna
2 mRNA
1 primaryTranscript
46 transcript
1 transcript_low_confidence
2 transcript_lowqual_or_TE
31 transcript_primaryTranscript
1 transcript_primaryTranscript_low_confidence
1 transcripts
Relatedly: I wish I had specified just "_primary" rather than "_primaryTranscript" - because "Transcript" is gratuitous and potentially confusing. This followed the Phytozome pattern: cds_primaryTranscriptOnly ... dropping "Only" I've not thought that the change would be worth the cost, but I'll ask your opinions, @adf-ncgr and @sammyjava
Speaking of gffread, I just (finally!) wrote a standalone GFF mine loader because I've finally had it with the stock IM loader which is incompatible with using READMEs for the metadata. I use the Biojava GFFReader which is pretty straightforward, just returns a FeatureI with what's on the GFF line. My GFF loader is 200 lines, plus the underlying DatastoreFileConverter that it extends that handles the README, while the stock IM GFF loader is an impossibly complex hairball of 10 classes which still requires that you write two custom handlers.
As for FASTA content, be it known @cann0010 that the mine FASTA loader only loads one type of object per file. What is in the file is specified externally: be it MRNA (which is what I use for all transcript.fna files), CDS, Protein. So a FASTA with mixed types is unsupported - they will all be loaded into a single mine class. I could load Transcript rather than MRNA into the mines (MRNA is a subclass of Transcript), but then we'd have mRNAs from the GFFs and Transcripts from the FASTAs with similar identifiers, which seems like a bad idea. (FYI I store the length from the FASTA so it's the actual transcript length, not the length of the full span on the chromosome.)
@sammyjava - "only loads one type of object per file." What are you recommending then (regarding transcripts vs. mRNAs)? If we want to hew to the SO terminology (GFF column 3), then we would go with mRNA. That would give a less ambiguous correspondence to the GFF. I think the term "transcript" is pretty soft (ill-defined). I take it as basically "splice-variant form of the gene", probably including the UTRs - but as a modifier for "protein" or "cds," it just means "splice-variant." It is also the probably the least frequently used data type from the annotation files. It might be used if someone wanted to design primers that spanned a UTR, but not for routine evolutionary analysis.
Well, I've always loaded strictly MRNA objects from the file, whatever it is is called (just as all sequences from cds.fna are loaded into CDS objects and all sequences from protein.faa are loaded into Protein objects.)
So it seems clearer if we have an explicit mrna.fna file in each annotation collection that contains only mRNA sequences; I'll then ignore a separate transcript.fna that may contain other stuff. I'm glad we're having this conversation as I think it's clarifying what is in those files in various collections.
PROPOSAL: every annotation collection must contain an mrna.fna.gz file that contains only mRNA sequences. An additional transcript.fna.gz file is optional and will be ignored in mine loading.
I've voted "yea", but now there are some devils in the details:
OK by me. One point of clarification, @sammyjava correct me if I'm wrong but I believe that if the mine loader encounters a sequence in an mrna file that doesn't correspond to an mrna in the gff file, it will create a new record rather than fail, which is not the end of the world (failure to load would be more doomsday-like). I know we have some transcript files that contain non-mrna things (e.g. tRNA); we can probably id such things by comparing counts of records in cds fasta with the corresponding mrna files. But in the likely event we don't get everything right away, I guess nothing is going to come to a grinding halt.
Correct, you'd just create an orphan mRNA object that has no chromosome location (and therefore won't show on the JBrowse). The FASTA/GFF objects are merged on (1) being the same class and (2) full-yuck identifier. So in the case of legit objects in both FASTA and GFF, we get a bit from the GFF (location,parent) while the sequence and length come from the FASTA because the identifiers match up. (The transfer-sequences post-process skips them because they already have sequences from the FASTA.)
Steven, shall I task you with the s/transcript/mrna/ and s/_primaryTranscript/_primary/ tasks since there may be a bit of content-verification involved? Otherwise I'm happy to just do the renaming, that's easy and changes nothing other than the checksum.
Also, just so it's known to all: I do load all RNA types from the GFFs (tRNA, ncRNA, miRNA, you name it). But the GFF loader loads each record into an InterMine object of the class associated with that type. And then, since non-mRNA objects aren't loaded from FASTAs, the sequences are derived from the chromosomal span (and strand) in the transfer-sequences post-processor. So if we were keen to load, say, ncRNA sequences from FASTAs, we certainly can do that by putting them in their own file. It's just that the FASTA loader loads one class per file.
Yes, I can do that renaming. An associated task is to update the two MANIFEST files. These don't currently have great utility, because they haven't been produced with great rigor; on the other hand, they do record prior filenames, which I think is important for provenance. For this minor renaming, I think I'll just change the "current filename" in the first field - as opposed to adding another "previous filename" field. One more thing: I think we ought to notify the group (via lis-developers) of the changes. I'll do that now.
Yes, I can do that renaming. An associated task is to update the two MANIFEST files.
The CHECKSUM files need to be updated as well.
CHECKSUM updates - yes. Will do that later today, after some more changes and checks.
This was done, closing.
Some collections have an mrna.fna; others have a transcript.fna. Which shall it be? I'm happy to rename them.