Open StevenCannon-USDA opened 5 months ago
@StevenCannon-USDA are you sure about calling these gnm1.ann2 annotations? I guess @jd-campbell had started a gnm1.ann1 but since it was never made public (that I know of) it seems a little weird to start counting at 2. But then again it wouldn't be the first time versioning got weird, so just let me know if you think this best and I'll go with it.
@adf-ncgr - You are right. Thanks for catching this. I'll fix this upstream (probably not until this evening though).
Looking into this more closely: @jd-campbell did an annotation on this assembly in 2021-2022. I believe we left it at the point of having BRAKER and Direct Information Hi- and Low confidence prediction, but with lack of clarity about how to pick a trusted set. So, this annotation was effectively shelved.
At that time (~2021), we had reserved to keys to use for stenosperma: PFL2 for gnm1 and CZRZ for gnm1.ann1. However, because ann1 was never finalized, we never really used key CZRZ.
At this point, we could ... (1) rename the GenBank annotation as gnm1.ann1 (2) use the GenBank annotation as currently prepared, as gnm1.ann2 ... And further, could either mint a new annotation key for gnm1.ann2, or repurpose CZRZ.
Of those options, I am leaning toward the option of least resistance - which is to use gnm1.ann2.CZRZ for the GenBank annotation; and just live with the fact that there is likely never to be an ann1. This would be similar to the situation for Glycine max Wm82, where we have gnm1, gnm2, gnm4, gnm6.
I don't feel strongly about this. Recalculating everything by replaying the notes would only take 15 minutes.
I guess I'll offer the proposal above (the first annotation on gnm1 will be called ann2) in the spirit of a Request for Objections. Seeking opinions/votes particularly from @sdash-github, @jd-campbell, @adf-ncgr
I don't object to 2), but my mild preference would be for 1). No preference on reusing the key or not. It's too bad that soybean didn't hop from gnm4 to gnm8, that would have been a nice numerical sequence (and the number of Wm82* genomes seems to grow exponentially no matter what!) thanks for following up
I'd say the mild preference (for 1) wins. I'll recalculate the collections now.
OK: option 1 has been implemented. The annotations collection is now Arachis/stenosperma/V10309.gnm1.ann1
(again in the annex).
For the record: replaying the data-preparation notes took half an hour - so the curation job is getting faster (though of course preparing the config file and modifying the notes for a new data set will still be a significant job -- maybe half a day or so, barring data curveballs).
Thanks Steven- will get it into the AHRD queue with the others
OK, the AHRD/BUSCO/GFA stuff is done on this. Should we go ahead and move from annex to public? Some of the following steps will probably prefer if it's located at the final resting place (e.g. the browser config updates which will want to embed stable URLs in the config file). I don't think the primary datastore contents will be updated much if any following those steps, so seems like a logical time to get it over there, but let me know your thoughts. Same will apply to the other couple of genomes which I've got at more or less the same level of analysis now. Happy to do the moving honors if you agree that is a sensible point at which to do so, @StevenCannon-USDA
Also, one thing possibly worth discussing further on this particular dataset (and other future NCBI-sourced annotations), it looks like you've decided to go ahead with the idea of coining transcript/protein ids by tacking on .digit suffixes to the gene ids as is common practice outside of NCBI. I don't see any major issue with that, but I do think that the featid_map file ought to record the mapping from what they used (ie accession numbers) to what we're using, ie not:
LOC130969383.1 arast.V10309.gnm1.ann1.LOC130969383.1
as currently in the file but instead:
rna-XM_057895071.1 arast.V10309.gnm1.ann1.LOC130969383.1
or maybe just
XM_057895071.1 arast.V10309.gnm1.ann1.LOC130969383.1
since if we were linking to NCBI it would probably be the accession number without the rna- prefix that would be used. Thoughts?
Thanks for the quick work. Moving them to public: sure. Feel free to do so - or do delegate back to me if you prefer.
About linking the LOC IDs back to the transcript accession IDs: I agree. I just wasn't up to it at the time. Dealing with the GenBank GFFs left me feeling beat up. I was slow to realize that not only do the mRNA IDs have no string-based correspondence with their parent gene IDs, but that different splice variants within a gene also have no string-based correspondence with one another:
cat genomic.gff | awk '$3~/gene|mRNA/ && $9~/LOC130944731/ {print $3,$9}' | cut -f1-2 -d';'
gene ID=gene-LOC130944731;Dbxref=GeneID:130944731
mRNA ID=rna-XM_057873233.1;Parent=gene-LOC130944731
mRNA ID=rna-XM_057873225.1;Parent=gene-LOC130944731
The (new) script that "fixes" this is rename_gff_mRNA_IDs.pl
To do next: I guess to have it also report the mapping ...
cat genomic.gff | awk '$3~/mRNA/ && $9~/LOC130944731/ {print $9}' |
perl -pe 's/ID=([^;]+);Parent=([^;]+);.+/$1\t$2/'
rna-XM_057873233.1 gene-LOC130944731
rna-XM_057873225.1 gene-LOC130944731
... and then teach ds_souschef to use that.
Hi again- quick note about Arachis/stenosperma/genomes/V10309.gnm1.PFL2 for @StevenCannon-USDA; looks like there are some minor differences between what's already in the v2 area of the datastore and what's in the annex. For example the one in v2 has chromosome names chr01-chr10 whereas the one in the annex has Chr1-Chr10 (all with full yuck of course). Since the gff file matches the new version in the annex, I assume we want to replace the older one with the new one but maybe you could take a quick look and ensure that the differences are all intentional before moving ahead? thanks!
Good to check. I do think we should use the new collection. These sequences come from GenBank. They should be the same as the older ones (the chromosome and scaffold sizes are indeed the same), but I think the principle is right. And the GFF feature correspondence matters of course.
Axctually, I just remembered that changing the chromosome names will have at least one implication, namely that some genome alignments we did previously using that genome against duranensis will need to be redone. Not a big deal in and of itself, and I can't think of other things outside of BLAST that would use the un-annotated genomes. But it looks like the NCBI sequence report just refers to them as 1-10 so presumably adding Chr or chr%02d is a subjective choice on our end. I kind of like using zero-padding on genomes with >=10 chromosomes, and it looks like every other Arachis genome has settled on the chr%02d form, so if you don't object or have other reasons maybe I will sed them back to that (including the gff). Let me know if you disagree!
No objections. Better to do the patch than to have to re-run the alignments.
Great, thanks for tolerating my obsession with trivial matters of consistency- I've made the changes for this and will proceed with the remaining tasks downstream.
@adf-ncgr and @StevenCannon-USDA I agree with labeling the NCBI annotation as ann1. The arast annotation I created was a rough draft.
@StevenCannon-USDA, getting an odd error (though claiming to be a non-error!) in the JBrowse2 in some regions:
NonError: "231: multi-line feature \"arast.V10309.gnm1.ann1.TRNAK-CUU\" has inconsistent types: \"tRNA\", \"gene\""
This seems to be due to the gene and tRNA feature that is its child having the same ID value. I think the handling of ncRNA genes can be a bit fiddly, and there may be reasons to segregate them out into different files (not that simply doing so would fully resolve this issue, or is a necessary condition of solving it). Let me know if I should try to formulate an RFO about this or file something as an issue against ds_souschef.
I think I would handle this as part of the pre-processing work - near line 90 in this file: https://github.com/legumeinfo/datastore-specifications/blob/main/PROTOCOLS/ds_souschef_prep_examples/NCBI_GenBank/notes_arast.V10309.gnm1.ann1.sh
OK, but what exactly are you proposing to do- just re-ID the tRNA features so that there is no collision with the gene IDs? One concern I have with leaving ncRNA genes in the gene_models_main file is that unless they are tagged in some way that allows them to be filtered out easily (e.g. the biotype attributes that I think are used in the original genbank gff but have gotten stripped out in these versions), then they will end up as "orphans" in the GCV, since by definition they will not match a gene family (at least in the current scheme of things where we use proteins for gene families).
For the reasons you give, I think I would remove them (probably into another file).
I can tackle this if you would like -- but I probably won't get to it until next week.
I'm happy to do it as post-processing on this one if you want to tackle making it part of the pre-processing in future.
I won't object. (Always happy to make you happy :-) )
OK, I've done the basic partitioning, although the separated non-coding file still has what JBrowse2 considers a "NonError". So if we offer that as a separate browser track, it will probably still need to get fixed. But I'm going to quit while I'm ahead...
Great. Thank you, and a good weekend to you.
Main steps for adding new genome and annotation collections
Genus/species/collection names:
What are the collection types and names? Example:
Arachis/stenosperma/annotations/V10309.gnm1.ann2.CZRZ
Arachis/stenosperma/genomes/V10309.gnm1.PFL2
[X] Add collection(s) to the Data Store (annex)
[X] Validate the README(s)
[x] Update about_this_collection.yml
[x] Calculate AHRD functional annotations
[x] Calculate gene family assignments (.gfa)
[ ] Add to pan-gene set
[ ] Load relevant mine
[ ] Add BLAST targets
[x] Incorporate into GCV
[x] Update the jekyll collections listing
[x] Update browser configs
[x] run BUSCO