dbcan with unassembled metagenomic sequence

ShriramHPatel commented 4 years ago

Hi dbcan dev,

I really appreciate your effort making this tool open source for the community.

Actually, we have 150*2 paired end shotgun sequenced data and we think assembling more than 500+ samples would take enormous amount of out time.

In that case, would it be wise enough to use unassembled metagenomic sequences (short reads) to profile CaZYmes using this pipeline?

Appreciate your advice on this.

Shriram

yinlabniu commented 4 years ago

Hi Shriram,

Thanks for the question. I do not recommend to use run_dbcan on the unassembled reads, because it simply won't work (the first step gene prediction will fail to produce meaningful protein sequences). Actually, another user had a similar request as yours. Raw read-based CAZyme annotation is possible, but needs substantial redesign of our pipeline, which most likely will lead to an entirely new software package. I will have one student to take on this project in the summer and hopefully will have a new tool available in a few months.

Yanbin

From: Shriram369 notifications@github.com Sent: Thursday, June 25, 2020 6:03 PM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [linnabrown/run_dbcan] dbcan with unassembled metagenomic sequence (#51)

Hi dbcan dev,

I really appreciate your effort making this tool open source for the community.

Actually, we have 150*2 paired end shotgun sequenced data and we think assembling more than 500+ samples would take enormous amount of out time.

In that case, would it be wise enough to use unassembled metagenomic sequences (short reads) to profile CaZYmes using this pipeline?

Appreciate your advice on this.

Shriram

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_51&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=mgWnDOiNrGuOehaqj585Y9lDCI_zt5SNB6cVkf0as3A&s=7x98eYMrjTlNdtZ3Oa_I_DiXghNJTjQhHsn4eZVKD3E&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZWSY7FWUN3QJPNT2Q3RYPJVHANCNFSM4OIZ6JRQ&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=mgWnDOiNrGuOehaqj585Y9lDCI_zt5SNB6cVkf0as3A&s=tUldyH8FazC6HMmoQ9Di8dz3bwMrnA7fc_obD2etjN8&e=.

ShriramHPatel commented 4 years ago

Thank you very much for your prompt feedback.

Also it would be worth taking your feedback/ suggestions on performing translated search (diamond blastx; taking into account all six putative frames) against "CAZyDB.07312019.fa.nr"?

Really looking forward for the tool.

Shriram

yinlabniu commented 4 years ago

Yes, diamond blastx should work but you might want to use a very stringent e-value, identity, and coverage threshold in parsing the result.

Yanbin

From: Shriram369 notifications@github.com Sent: Friday, June 26, 2020 4:51 AM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Yanbin Yin yyin@unl.edu; Comment comment@noreply.github.com Subject: Re: [linnabrown/run_dbcan] dbcan with unassembled metagenomic sequence (#51)

Thank you very much for your prompt feedback.

Also it would be worth taking your feedback/ suggestions on performing translated search (diamond blastx; taking into account all six putative frames) against "CAZyDB.07312019.fa.nr"?

Really looking forward for the tool.

Shriram

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_51-23issuecomment-2D650093546&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=Mx-XofNLuxKScUXq3tNGW99CDaCwOEOVp6BaOw7BZos&s=Eoma7FVjaVGE0Z5PJJiGyQovCim6nxGISdVL4-ImKBo&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZUULBITG6DJ7L37R53RYRVSRANCNFSM4OIZ6JRQ&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=Mx-XofNLuxKScUXq3tNGW99CDaCwOEOVp6BaOw7BZos&s=FQMUGNoLsCAXjiu8V_fpI1PbQIm4AqFVNgY0ZTGv4Gw&e=.

ShriramHPatel commented 4 years ago

Perfect, thanks for your feedback.

ShriramHPatel commented 4 years ago

Hi Yanbin,

I have a (silly) follow-up question in the line with the discussion of performing diamond blastx search with short read metagenomic sequences. Hope your insights in this will also help other dbcan users as well.

I am planning to use non redundant version of CAZy reference (CAZyDB.07312019.fa.nr) database in blastx search followed by stringent parsing of the alignment results as per your suggestion. But on closer inspection of the database, I have found that some of the sequences are annotated with multiple CAZy families.

For example one of the reference sequence is annotated with both CBM26 and GH13-28 (>ADM36368.1|CBM26|GH13_28|). So, when parsing the alignment results, should we provide one count each to CBM26 and GH13-28? or any other recommended / suggested approach?

Thank you.

Shriram

yinlabniu commented 4 years ago

Hi Yanbin,

I have a (silly) follow-up question in the line with the discussion of performing diamond blastx search with short read metagenomic sequences. Hope your insights in this will also help other dbcan users as well.

I am planning to use non redundant version of CAZy reference (CAZyDB.07312019.fa.nr) database in blastx search followed by stringent parsing of the alignment results as per your suggestion. But on closer inspection of the database, I have found that some of the sequences are annotated with multiple CAZy families.

For example one of the reference sequence is annotated with both CBM26 and GH13-28 (>ADM36368.1|CBM26|GH13_28|). So, when parsing the alignment results, should we provide one count each to CBM26 and GH13-28? or any other recommended / suggested approach?

Thank you.

Shriram

yinlabniu commented 4 years ago

Sorry for the late response. If you do blastx search, take the best cazy hit for each read. Then when count how many reads match a cazy hit, you want to consider all the domains for each cazy hit. For the example you give, you would add a count each to CBM26 and GH13-28.

Yanbin

linnabrown / run_dbcan

dbcan with unassembled metagenomic sequence #51