Closed kbrevs closed 7 years ago
Hi,
By mapping GO terms from Pfam domain matches alone you will always get fewer than the input number of transcripts. The first thing I would check is the length distribution of the transcripts. There is a default length of 80 bp chosen for selecting ORFs, which I would not decrease because this will just bring in low quality matches, so you may find that many of your transcripts are being filtered by length and the mapping of terms is better than it appears.
You can get a quick look at this by the following:
grep -c ">" agla_transcripts.fa
grep -c ">" AGLA_genes_orfs.faa
If those are similar then I may offer some further advice but in my experience the length issue is the most important factor for draft genomes.
Thanks, Evan
Any updates? If there are no further questions we can close this one. Thanks.
Great - thanks for the info! That is helpful.
This may be more of a clarification than an issue - I'm starting off with 21175 predicted genes from Maker, but I end up with only 6993 of the predicted genes having GO terms mapped to them by the end, in the GOterm_mapping.tsv file. Hopefully I'm just missing something obvious, this is what I am running, let me know if info from intermediate steps would be helpful...
hmmer2go getorf -i ~/Documents/Arthropod_Genomes/genome_files/agla_transcripts.fa -o AGLA_genes_orfs.faa
hmmer2go run --cpu 20 -i AGLA_genes_orfs.faa -d Pfam-A.hmm -o AGLA_genes_orf_Pfam-A.tblout
hmmer2go mapterms -i AGLA_genes_orfs_Pfam-A.tblout -o AGLA_genes_orfs_Pfam-A_GO.tsv --map
Thanks!