N-genes counts from contigs normalized by total reads is not good indicator of gene abundance

jianshu93 commented 4 years ago

Should short reads from metagenomic data be mapped to those genes (or protein) (>90% identity for blastx or diamond blastx for example) that were found in the database to calculate relative abundance of each gene? but not only how many hits of that gene you have against the database because some shot reads may not be assembled successfully but they are actually part of those genes. I do not see any discussion about this in the SBB paper.

qichao1984 commented 4 years ago

Hi, the script does exactly what you were asking. No assembly is needed to profile N cycling using NCycDB. All reads were mapped to the database to get N cycling profiles.

Get Outlook for Androidhttps://aka.ms/ghei36

From: Jianshu_Zhao notifications@github.com Sent: Saturday, May 23, 2020 1:54:12 PM To: qichao1984/NCyc NCyc@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [qichao1984/NCyc] N-genes counts from contigs normalized by total reads is not good indicator of gene abundance (#13)

Should short reads from metagenomic data be mapped to those genes (or protein) (>90% identity for blastx or diamond blastx for example) that were found in the database to calculate relative abundance of each gene? but not only how many hits of that gene you have against the database because some shot reads may not be assembled successfully but they are actually part of those genes. I do not see any discussion about this in the SBB paper.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/13, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGHLZB2VADQLNVBIZ6LRS5QIJANCNFSM4NIJ2K6Q.

jianshu93 commented 4 years ago

Hi, in that case, I think global alignment using usearch -usearch_global is not a good choice for mapping short reads to long gene (Or what is your purpose using global alignment?). It is reasonable to use local alignment. BTW, mapping reads to database take a lot of time because metagenomics dataset is often very big, any possibility to parallelize it? Or you can just call multiple processes that run a subset of the original reads and then put the output together? Bowtie2 or Minimap2 or urmap can be good choices for mapping reads to databases with high identity. Which reads should I use for mapping, forward, reverse or merged if the input file is unassembled?

Thanks,

Jianshu

qichao1984 commented 4 years ago

Direct read mapping is not recommended as we are not dealing with large comprehensive nucleotide database. Instead, NCycDB is a protein database for functional assignments. Database searching using blast like tools will return almost all possible N cycling genes. By default, diamond is recommended for the speed and sensitivity.

Get Outlook for Androidhttps://aka.ms/ghei36

From: Jianshu_Zhao notifications@github.com Sent: Monday, May 25, 2020 8:50:24 PM To: qichao1984/NCyc NCyc@noreply.github.com Cc: Qichao Tu philloid@gmail.com; Comment comment@noreply.github.com Subject: Re: [qichao1984/NCyc] N-genes counts from contigs normalized by total reads is not good indicator of gene abundance (#13)

Hi, in that case, I think global alignment using usearch -usearch_global is not a good choice for mapping short reads to long gene (Or what is your purpose using global alignment?). It is reasonable to use local alignment. BTW, mapping reads to database take a lot of time because metagenomics dataset is often very big, any possibility to parallelize it? Or you can just call multiple processes that run a subset of the original reads and then put the output together? Bowtie2 or Minimap2 or urmap can be good choices for mapping reads to databases with high identity. Which reads should I use for mapping, forward, reverse or merged if the input file is unassembled?

Thanks,

Jianshu

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/13#issuecomment-633556661, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGELEJLUIHCQCYDCW2DRTJSRBANCNFSM4NIJ2K6Q.

qichao1984 commented 4 years ago

Searching metagenomes against small functional gene databases is not time consuming. Parallelization can be carried out at the tool level, e.g. diamond.

Get Outlook for Androidhttps://aka.ms/ghei36

From: Qichao Tu philloid@gmail.com Sent: Monday, May 25, 2020 9:00:39 PM To: qichao1984/NCyc NCyc@noreply.github.com; qichao1984/NCyc reply@reply.github.com Cc: Comment comment@noreply.github.com Subject: Re: [qichao1984/NCyc] N-genes counts from contigs normalized by total reads is not good indicator of gene abundance (#13)

Direct read mapping is not recommended as we are not dealing with large comprehensive nucleotide database. Instead, NCycDB is a protein database for functional assignments. Database searching using blast like tools will return almost all possible N cycling genes. By default, diamond is recommended for the speed and sensitivity.

Get Outlook for Androidhttps://aka.ms/ghei36

From: Jianshu_Zhao notifications@github.com Sent: Monday, May 25, 2020 8:50:24 PM To: qichao1984/NCyc NCyc@noreply.github.com Cc: Qichao Tu philloid@gmail.com; Comment comment@noreply.github.com Subject: Re: [qichao1984/NCyc] N-genes counts from contigs normalized by total reads is not good indicator of gene abundance (#13)

Hi, in that case, I think global alignment using usearch -usearch_global is not a good choice for mapping short reads to long gene (Or what is your purpose using global alignment?). It is reasonable to use local alignment. BTW, mapping reads to database take a lot of time because metagenomics dataset is often very big, any possibility to parallelize it? Or you can just call multiple processes that run a subset of the original reads and then put the output together? Bowtie2 or Minimap2 or urmap can be good choices for mapping reads to databases with high identity. Which reads should I use for mapping, forward, reverse or merged if the input file is unassembled?

Thanks,

Jianshu

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/13#issuecomment-633556661, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGELEJLUIHCQCYDCW2DRTJSRBANCNFSM4NIJ2K6Q.

jianshu93 commented 4 years ago

Again, which reads should I use for NCyc? merged reads? According to my knowledge, annotate short reads to protein databases should be cautious because partial gene will be searched (amoa gene for archea and for bacteria have 80% protein identity, a short read that was annotated as amoa can either be archaea or bacteria because it is short, and may only cover a small region of amoa, which you cannot tell it is archaea or bacteria). My suggestion is to use the assembled contigs and predict genes using prodigal or FragGeneScan, then annotate predicted genes against your database. For the abundance of each gene, mapping short reads to each gene then normalized the number mapped reads for each gene by total reads as the relative abundance of that gene. Do you think it is a good idea?

Thanks for the response,

Jianshu

jianshu93 commented 4 years ago

Then usearch -ublast/-usearch_local will be a good option for functional annotation (local alignment, blast and diamond are all local alignment tools), but not usearch -usearch_global (this is only for conserved genes searching with similar length such as 16S, or N gene but must be roughly the same length). You cannot use usearch -usearch_global to search short reads against long reads (your database), global alignment is not a good option here. You will to have reasonable searches even if you have some good hits.

Thanks,

Best,

Jianshu

qichao1984 commented 4 years ago

I would recommend to use merged reads to get N cycling gene profiles, especially for those complex environmental samples. Gene prediction of contigs and then mapping to predicted genes should also work, but the number of mapped reads highly depends on the assembly rate, which could be very low for environmental samples.
Ideally, usearch_local and usearch_global should return similar results, as only best hits are used for functional annotation. However, 64bit version usearch is usually needed for large metagenome datasets. Therefore, diamond was used by default in the script.

qichao1984 / NCyc

N-genes counts from contigs normalized by total reads is not good indicator of gene abundance #13