legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

Add genome and annotations for Acacia crassicarpa #195

Open StevenCannon-USDA opened 7 months ago

StevenCannon-USDA commented 7 months ago

Main steps for adding new genome and annotation collections

Genus/species/collection names:

adf-ncgr commented 7 months ago

Little FYI on this one, @StevenCannon-USDA ; it looks like most of the children of the gene features are given type "transcript" but some of the AHRD-related processing behaves badly if protein-coding transcripts are not given as mRNA. Let me know if you have any concerns about me making that change wholesale. I checked and the number of genes is only off by one from the number of primary proteins, so I think it's fair to assume they are mRNA. The discrepancy seems to be caused by a gene with ID=acacr.Acra3RX.gnm1.ann1.nbis-gene-1 which also has attributes: gene_id=g12430;transcript_id=g12430.t2 which seems to imply it really ought to just be another isoform of g12430. I suppose I could manually fix that little oddity as well.

StevenCannon-USDA commented 7 months ago

Thank you - and I don't have concerns about s/transcript/mRNA/. (Why are there a million ways to munge a GFF?)

adf-ncgr commented 7 months ago

(Why are there a million ways to munge a GFF?)

maybe you should write a song about it! ;)

adf-ncgr commented 7 months ago

Oh, actually there's probably more to do on this file, but I'd like a second opinion. In addition to that one weird little gene, it looks like there are a bunch (~2000) of existing mRNA features that appear to essentially just be duplicates of what was originally represented as "transcript", but with an odd ID that nothing else references. For example:

acacr.Acra3RX.gnm1.scaffold_1   GeneMark.hmm3   mRNA    127849  128478  .       +       .       ID=acacr.Acra3RX.gnm1.ann1.nbis-mrna-1;Parent=acacr.Acra3RX.gnm1.ann1.g3;gene_id=g3;transcript_id=g3.t1
acacr.Acra3RX.gnm1.scaffold_1   GeneMark.hmm3   transcript      127849  128478  .       +       .       ID=acacr.Acra3RX.gnm1.ann1.g3.t1;Parent=acacr.Acra3RX.gnm1.ann1.g3

note that the transcript_id=g3.t1 part of the first one seems to suggest it really is a duplicate of the one below. I'm proposing to just delete these "extras" since they don't appear to provide any value and are arguably detrimental in that they appear in the transcripts fasta without any splicing, since they have no exon children (they don't appear in cds or protein since they don't have CDS children). Let me know if you see something about these that I'm overlooking that would argue for their being preserved

StevenCannon-USDA commented 7 months ago

I agree with you: OK to delete those transcript records that seem to be duplicates of the mRNA features.

History of this gene file: I received it in gtf format (broken actually, with 29 lines being space- rather than tab-separated). I used AGAT to transform the file to gff3. Labels nbis indicate new identifiers added by AGAT (NBIS=National Bioinformatics Infrastructure Sweden).

The original structure of the of the noncompliant records is:

scaffold_1  GeneMark.hmm3 gene  127849  128478  . + . g3
scaffold_1  GeneMark.hmm3 transcript  127849  128478  . + . g3.t1
scaffold_1  GeneMark.hmm3 start_codon 127849  127851  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 mRNA  127849  128478  . + . transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 CDS 127849  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 exon  127849  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 stop_codon  128476  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";

In the gff3, this becomes

scaffold_1  GeneMark.hmm3 gene  127849  128478  . + . ID=g3
scaffold_1  GeneMark.hmm3 mRNA  127849  128478  . + . ID=nbis-mrna-1;Parent=g3;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 transcript  127849  128478  . + . ID=g3.t1;Parent=g3
scaffold_1  GeneMark.hmm3 exon  127849  128478  . + 0 ID=exon-26;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 CDS 127849  128478  . + 0 ID=cds-26;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 start_codon 127849  127851  . + 0 ID=start_codon-4;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 stop_codon  128476  128478  . + 0 ID=stop_codon-4;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1

My reading of this is that the transcript record is a duplicate of mRNA and could be deleted.

adf-ncgr commented 7 months ago

Thanks, this helps clarify- I think I've seen "nbis" appearing in other files occasionally too. In any case I'll delete the "nbis" records and retain the other to keep the naming consistent (and switch the "transcript" -> "mRNA").