sestaton / HMMER2GO

Annotate DNA sequences for Gene Ontology terms
MIT License
40 stars 10 forks source link

number of final terms #13

Closed kbrevs closed 7 years ago

kbrevs commented 7 years ago

This may be more of a clarification than an issue - I'm starting off with 21175 predicted genes from Maker, but I end up with only 6993 of the predicted genes having GO terms mapped to them by the end, in the GOterm_mapping.tsv file. Hopefully I'm just missing something obvious, this is what I am running, let me know if info from intermediate steps would be helpful...

hmmer2go getorf -i ~/Documents/Arthropod_Genomes/genome_files/agla_transcripts.fa -o AGLA_genes_orfs.faa

hmmer2go run --cpu 20 -i AGLA_genes_orfs.faa -d Pfam-A.hmm -o AGLA_genes_orf_Pfam-A.tblout

hmmer2go mapterms -i AGLA_genes_orfs_Pfam-A.tblout -o AGLA_genes_orfs_Pfam-A_GO.tsv --map

Thanks!

sestaton commented 7 years ago

Hi,

By mapping GO terms from Pfam domain matches alone you will always get fewer than the input number of transcripts. The first thing I would check is the length distribution of the transcripts. There is a default length of 80 bp chosen for selecting ORFs, which I would not decrease because this will just bring in low quality matches, so you may find that many of your transcripts are being filtered by length and the mapping of terms is better than it appears.

You can get a quick look at this by the following:

grep -c ">" agla_transcripts.fa
grep -c ">" AGLA_genes_orfs.faa

If those are similar then I may offer some further advice but in my experience the length issue is the most important factor for draft genomes.

Thanks, Evan

sestaton commented 7 years ago

Any updates? If there are no further questions we can close this one. Thanks.

kbrevs commented 7 years ago

Great - thanks for the info! That is helpful.