Closed mradz19 closed 1 year ago
Yes, as long as you code a script yourself.
Annotations per ORF can be found in project/results/13*.orftable
.
The actual DNA sequences for the different ORFs can be found in project/results/03.*.fna
Thanks @fpusan, sorry to bother but would you have an example script of how I would do that?
No, sorry, I just use SQMtools for that.
Thanks I managed to work it out. I used blastx on the sequences but I am getting strange results. I am looking at K01197 and when I use blastx on the sequences to the non redundant protein sequence database the results differ from what SqueeseMeta classified them as. On NCBI most of the sequcences are mapping to a beta-N-acetylglucosaminidase domain-containing protein, however according to squeezemeta these sequences are associated with hya (k01197). Is this just due to differences in the database?
I should also mention I tried it with Refseq and a few other databases, most of which said the sequence was from a beta-N-acetylglucosaminidase domain-containing protein
Most likely. Consider that the KEGG version included in SqueezeMeta is very old, the last one that was public. You also have the capability of including new databases, for instance newer versions of KEGG in case you or your institution have access to it.
Best, J
I see. Just out of curiosity what was the motivation for using KEGG, COG and PFAM as the databases over something like RefSeq? Also what version of each database is being used? I can see COG was updated in 2020, does SqueezeMeta use that or or the 2014 version? And what version of PFAM is being used?
We do use refseq for taxonomic annotation. But KEGG/COG/PFAM have consistent hierarchies of funcional categories down to (hopefully) individual clusters of proteins with similar function.
Is it possible to extract the actual DNA sequences for specific KEGG IDs without using subseFun() from the SQMtools R package?