How to extract sequences for a specific KEGG ID

jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis

GNU General Public License v3.0

348 stars 81 forks source link

How to extract sequences for a specific KEGG ID #710

Closed mradz19 closed 1 year ago

mradz19 commented 1 year ago

Is it possible to extract the actual DNA sequences for specific KEGG IDs without using subseFun() from the SQMtools R package?

fpusan commented 1 year ago

Yes, as long as you code a script yourself. Annotations per ORF can be found in project/results/13*.orftable. The actual DNA sequences for the different ORFs can be found in project/results/03.*.fna

mradz19 commented 1 year ago

Thanks @fpusan, sorry to bother but would you have an example script of how I would do that?

fpusan commented 1 year ago

No, sorry, I just use SQMtools for that.

mradz19 commented 1 year ago

Thanks I managed to work it out. I used blastx on the sequences but I am getting strange results. I am looking at K01197 and when I use blastx on the sequences to the non redundant protein sequence database the results differ from what SqueeseMeta classified them as. On NCBI most of the sequcences are mapping to a beta-N-acetylglucosaminidase domain-containing protein, however according to squeezemeta these sequences are associated with hya (k01197). Is this just due to differences in the database?

mradz19 commented 1 year ago

I should also mention I tried it with Refseq and a few other databases, most of which said the sequence was from a beta-N-acetylglucosaminidase domain-containing protein

jtamames commented 1 year ago

Most likely. Consider that the KEGG version included in SqueezeMeta is very old, the last one that was public. You also have the capability of including new databases, for instance newer versions of KEGG in case you or your institution have access to it.

Best, J

mradz19 commented 1 year ago

I see. Just out of curiosity what was the motivation for using KEGG, COG and PFAM as the databases over something like RefSeq? Also what version of each database is being used? I can see COG was updated in 2020, does SqueezeMeta use that or or the 2014 version? And what version of PFAM is being used?

fpusan commented 1 year ago

We do use refseq for taxonomic annotation. But KEGG/COG/PFAM have consistent hierarchies of funcional categories down to (hopefully) individual clusters of proteins with similar function.