load maker peanut gene models

adf-ncgr commented 10 years ago

[LEGUME-138] created by adf_ncgr

adf-ncgr commented 10 years ago

question for Steven (I think)-
we have gffs separated into two parts:
Aradu.V14167.a1.M1.genemodels.gff
Aradu.V14167.a1.M1.genemodels.gff.lowqual_or_TE

were both sets included in the gene trees? If so, we will load them both. If not, should we load them both anyway??

by adf_ncgr

adf-ncgr commented 10 years ago

Hi Andrew -
Sorry to be slow on this (missed it in the recent deluge). The lowqual_or_TE are not in the trees. I think: no reason to load them (though probably no harm if they are already).

by scannon

adf-ncgr commented 10 years ago

thanks-
I was actually just writing you about another couple of issues relating to peanut gene models.
I think these both more or less stem from a revision of the gffs that was done on your side at some point
in the recent past (although the revision was overall helpful in taking care of some other issues).
a) minor: the polypeptides for the peanut gene models don't have the same naming convention as
the transcripts- I think someone decided to add ".1" suffixes to the mRNAs IDs/Names in the gff, but
the peptide sequences and transcript sequence in the fastas don't have these (hence the peptides in
the trees don't have them). I think we probably ought to go ahead and just add the same suffixes to the
peptides (I can do this, as it will help things keep moving on my end)

b) less minor, but possibly not clear what's best:
I had been hoping that the MAKER attributes dealing with "evidence" would be a good addition for building
more refined gene page views (since we're somewhat limited at this point on how much we can say about these
genes). But in the revised gffs, these seem to have been stripped out. For example, instead of:
ID=Aradu.B2QWP;Parent=Aradu.B2QWP;Name=Aradu.B2QWP;_AED=0.25;_eAED=0.25;_QI=79|0|0.14|0.28|1|1|7|387|424
we have now just:
ID=Aradu.B2QWP.1;Parent=Aradu.B2QWP;Name=Aradu.B2QWP.1

this doesn't impact the gene families, but might have implications for Hrishi's work on gene pages

by adf_ncgr

adf-ncgr commented 10 years ago

Regarding a): I think the .1 suffix was added to the mRNAs so that the mRNA wouldn't be its own parent; instead, Aradu.A01 maker mRNA 1773586017739171. + . ID=Aradu.B2QWP.1;Parent=Aradu.B2QWP
So I think adding the ".1" to the peptide and transcript sequences may be the right thing to do. (I have a feeling this is getting circular somewhere ...)

Regarding b): I don't recall exactly when the AED scores got stripped out. Probably along the way toward making human-readable descriptions for the browser. The GFF line for the gene above now looks like
Aradu.A01 maker gene 17735860 17739171 . + . ID=Aradu.B2QWP;Name=Aradu.B2QWP;Note=uncharacterized protein LOC100797259 isoform X4 [Glycine max]%3B IPR004332 (Transposase%2C MuDR%2C plant)
(available from the browser; I'll shortly update the download files, which are moving on -stage to e.g. /files/genomes/Arachis_duranensis/annotation/maker/ )

About the AED scores: I am a little conflicted about those. I think only MAKER initiates will appreciate those. I wonder if Hrishi could hash that information in for the purpose of loading the database and the gene pages?

by scannon

adf-ncgr commented 10 years ago

More about this (possibly reversing part of my previous comment): "I think we probably ought to go ahead and just add the same suffixes to the peptides": right now (Sunday afternoon), the tarballs at e.g. /files/genomes/Arachis_duranensis/annotation/maker/ on lis-stage don't have ".1" suffixes for the peptide.fa or transcript.fa files. The genes in the GFF(s) don't have the suffix, and there is only one splice variant for each gene. In the GFF files, the mRNA need the .1 suffix to distinguish them from their parent gene features, however. So I guess my question is: do you think we can get away with not adding .1 to the fasta files? In the context of the gene trees (and really most other contexts), ".1" ends up being basically two extra (useless) digits. May be best to talk this one through by phone.

by scannon

adf-ncgr commented 10 years ago

I have added a ".1" suffix to each transcript and peptide fasta file. These are available at peanutbase-stage via
http://peanutbase-stage.agron.iastate.edu/genomes
and thence to
http://peanutbase-stage.agron.iastate.edu/files/genomes/Arachis_duranensis/annotation/Aradu.V14167.a1.G1.tar.gz
http://peanutbase-stage.agron.iastate.edu/files/genomes/Arachis_duranensis/annotation/Aradu.V14167.a1.M1.tar.gz
http://peanutbase-stage.agron.iastate.edu/files/genomes/Arachis_duranensis/annotation/Araip.K30076.a1.G1.tar.gz
http://peanutbase-stage.agron.iastate.edu/files/genomes/Arachis_duranensis/annotation/Araip.K30076.a1.M1.tar.gz

by scannon

adf-ncgr commented 10 years ago

thanks, I am looking at the tarball contents and slightly confused by:
Araip.K30076.a1.M1.genemodels.final.annot.gff
Araip.K30076.a1.M1.genemodels.gff

and
Aradu.V14167.a1.M1.genemodels.final.annot.gff
Aradu.V14167.a1.M1.genemodels.gff

it looks like the difference between "final.annot" and other versions is the inclusion of AHRD descriptors in the attributes, e.g.:
Note=uncharacterized protein LOC100797259 isoform X4 [Glycine max]%3B IPR004332 (Transposase%2C MuDR%2C plant)

I'll move forward under the assumption these are the better ones to use for loading purposes, but let me know if this incorrect...

by adf_ncgr

adf-ncgr commented 10 years ago

one other thing. in the context of legumeinfo, I'll be loading these with the Name modifications previously discussed
(ie prefixing the aradu. araip. namespaces).

It seems likely that this is superfluous in the peanutbase context, especially given the fact that these already have
the species identifiers built-in to their names. But let me know if you think consistency between legumeinfo and
peanutbase in this regard would be more important. I can't claim to have thought it through much. In any case,
I'll be focusing on legumeinfo first for the sake of the genetrees and because I still need to get access to the
peanutbase servers...

by adf_ncgr

adf-ncgr commented 10 years ago

You are correct: use .final.annot.gff (I suppose I should get rid of the ones without annotations; will do that some time today).

by scannon

adf-ncgr commented 10 years ago

OK: I have cleaned up the directories (Aradu.V14167.a1.M1.tar.gz and Araip.K30076.a1.M1) - tidying the READMEs, adding the usage agreement, removing the penultimate GFF, and removing "final" from the .genemodels.annot.gff files. I also added a hash file with AED scores to each directory (may be useful in case Hrishi wants to add that information to the gene pages etc.)
Let me know if you see any other rough edges.

by scannon

adf-ncgr commented 10 years ago

done on lis-stage. in progress on peanutbase-stage

by adf_ncgr

legumeinfo / jira-issues

load maker peanut gene models #106