Open StevenCannon-USDA opened 7 months ago
Little FYI on this one, @StevenCannon-USDA ; it looks like most of the children of the gene features are given type "transcript" but some of the AHRD-related processing behaves badly if protein-coding transcripts are not given as mRNA. Let me know if you have any concerns about me making that change wholesale. I checked and the number of genes is only off by one from the number of primary proteins, so I think it's fair to assume they are mRNA. The discrepancy seems to be caused by a gene with ID=acacr.Acra3RX.gnm1.ann1.nbis-gene-1 which also has attributes: gene_id=g12430;transcript_id=g12430.t2 which seems to imply it really ought to just be another isoform of g12430. I suppose I could manually fix that little oddity as well.
Thank you - and I don't have concerns about s/transcript/mRNA/. (Why are there a million ways to munge a GFF?)
(Why are there a million ways to munge a GFF?)
maybe you should write a song about it! ;)
Oh, actually there's probably more to do on this file, but I'd like a second opinion. In addition to that one weird little gene, it looks like there are a bunch (~2000) of existing mRNA features that appear to essentially just be duplicates of what was originally represented as "transcript", but with an odd ID that nothing else references. For example:
acacr.Acra3RX.gnm1.scaffold_1 GeneMark.hmm3 mRNA 127849 128478 . + . ID=acacr.Acra3RX.gnm1.ann1.nbis-mrna-1;Parent=acacr.Acra3RX.gnm1.ann1.g3;gene_id=g3;transcript_id=g3.t1
acacr.Acra3RX.gnm1.scaffold_1 GeneMark.hmm3 transcript 127849 128478 . + . ID=acacr.Acra3RX.gnm1.ann1.g3.t1;Parent=acacr.Acra3RX.gnm1.ann1.g3
note that the transcript_id=g3.t1 part of the first one seems to suggest it really is a duplicate of the one below. I'm proposing to just delete these "extras" since they don't appear to provide any value and are arguably detrimental in that they appear in the transcripts fasta without any splicing, since they have no exon children (they don't appear in cds or protein since they don't have CDS children). Let me know if you see something about these that I'm overlooking that would argue for their being preserved
I agree with you: OK to delete those transcript records that seem to be duplicates of the mRNA features.
History of this gene file: I received it in gtf
format (broken actually, with 29 lines being space- rather than tab-separated). I used AGAT to transform the file to gff3. Labels nbis
indicate new identifiers added by AGAT (NBIS=National Bioinformatics Infrastructure Sweden).
The original structure of the of the noncompliant records is:
scaffold_1 GeneMark.hmm3 gene 127849 128478 . + . g3
scaffold_1 GeneMark.hmm3 transcript 127849 128478 . + . g3.t1
scaffold_1 GeneMark.hmm3 start_codon 127849 127851 . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1 GeneMark.hmm3 mRNA 127849 128478 . + . transcript_id "g3.t1"; gene_id "g3";
scaffold_1 GeneMark.hmm3 CDS 127849 128478 . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1 GeneMark.hmm3 exon 127849 128478 . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1 GeneMark.hmm3 stop_codon 128476 128478 . + 0 transcript_id "g3.t1"; gene_id "g3";
In the gff3, this becomes
scaffold_1 GeneMark.hmm3 gene 127849 128478 . + . ID=g3
scaffold_1 GeneMark.hmm3 mRNA 127849 128478 . + . ID=nbis-mrna-1;Parent=g3;gene_id=g3;transcript_id=g3.t1
scaffold_1 GeneMark.hmm3 transcript 127849 128478 . + . ID=g3.t1;Parent=g3
scaffold_1 GeneMark.hmm3 exon 127849 128478 . + 0 ID=exon-26;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1 GeneMark.hmm3 CDS 127849 128478 . + 0 ID=cds-26;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1 GeneMark.hmm3 start_codon 127849 127851 . + 0 ID=start_codon-4;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1 GeneMark.hmm3 stop_codon 128476 128478 . + 0 ID=stop_codon-4;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
My reading of this is that the transcript
record is a duplicate of mRNA
and could be deleted.
Thanks, this helps clarify- I think I've seen "nbis" appearing in other files occasionally too. In any case I'll delete the "nbis" records and retain the other to keep the naming consistent (and switch the "transcript" -> "mRNA").
Main steps for adding new genome and annotation collections
Genus/species/collection names:
Acacia/crassicarpa/genomes/Acra3RX.gnm1.YX4L
Acacia/crassicarpa/annotations/Acra3RX.gnm1.ann1.6C0V
[X] Add collection(s) to the Data Store, including commits to datastore-metadata (at annex as of 2024-02-13)
[X] Validate the README(s)
[ ] Update about_this_collection.yml
[ ] Calculate AHRD functional annotations
[ ] Calculate gene family assignments (.gfa)
[ ] Add to pan-gene set
[ ] Load relevant mine
[ ] Add BLAST targets
[ ] Incorporate into GCV
[ ] Update the jekyll collections listing
[ ] Update browser configs
[ ] run BUSCO
[ ] Update DSCensor
[ ] Add LINKOUTS to datastore, refresh linkout service