High variation in seq length after merging and high proportion of eukaryotes in tax assignment

Starnite commented 7 years ago

Hi @cdiener ,

When you were running dada2 on the data, did you notice a high variation in sequence lengths after sequence inference and merging paired reads? I have sequences that vary in length from 230 to 399.

In addition, taxonomic assignment on these sequence variants has more than 50% of the unique variants being eukaryotic sequences.

My chimera slaying results seem to be in line with yours.

Thanks, Vivian

cdiener commented 7 years ago

I get the same in length distribution

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  230.0   272.0   272.0   287.1   284.0   399.0

which I think is ok if you are explicitly merging forward/reverse reads. The min and max are pretty much what you would expect since for forward reads we trim the first 10 bps and truncate the length to 240 (so length = 240 - 10). For the merged ones you need a minimum overlap of 20 and the reverse reads have a max length of 190 which gives a maximum merged length of (400 = 230 + 190 - 20). However most merged seqs fall into the IQR of 272-284 which indicates there is a lot of overlap which is good :)

For the kingdom i get the following:

  Archaea  Bacteria Eukaryota 
       48     10058      6658

So there are a lot of eukaryota but I would think that is a bit inflated since 16S is pretty bad in discerning eukaryotes and the DADA2 error model might not work to well here. In terms of reads those contribute very little:

❯ sum(seqtab[, taxa[,1] == "Eukaryota"], na.rm = T)
[1] 161710

❯ sum(seqtab, na.rm = T)
[1] 54099558

So that is not even 1% of the reads... I did not try to align the reads to the human genome yet. Might be worthwhile to filter those out...

Starnite commented 7 years ago

Hi Christian,

Thanks for the info - glad to have the confirmation.

Cheers, Vivian

From: Christian Diener notifications@github.com Reply-To: resendislab/diabetes_analysis reply@reply.github.com Date: Friday, October 27, 2017 at 12:29 PM To: resendislab/diabetes_analysis diabetes_analysis@noreply.github.com Cc: Vivian Zhong vivzhong@mit.edu, Author author@noreply.github.com Subject: Re: [resendislab/diabetes_analysis] High variation in seq length after merging and high proportion of eukaryotes in tax assignment (#2)

I get the same in length distribution

Min. 1st Qu. Median Mean 3rd Qu. Max.

230.0 272.0 272.0 287.1 284.0 399.0

which I think is ok if you are explicitly merging forward/reverse reads. The min and max are pretty much what you would expect since for forward reads we trim the first 10 bps and truncate the length to 240 (so length = 240 - 10). For the merged ones you need a minimum overlap of 20 and the reverse reads have a max length of 190 which gives a maximum merged length of (400 = 230 + 190 - 20). However most merged seqs fall into the IQR of 272-284 which indicates there is a lot of overlap which is good :)

For the kingdom i get the following:

Archaea Bacteria Eukaryota

   48     10058      6658

So there are a lot of eukaryota but I would think that is a bit inflated since 16S is pretty bad in discerning eukaryotes and the DADA2 error model might not work to well here. In terms of reads those contribute very little:

❯ sum(seqtab[, taxa[,1] == "Eukaryota"], na.rm = T)

[1] 161710

❯ sum(seqtab, na.rm = T)

[1] 54099558

So that is not even 1% of the reads... I did not try to align the reads to the human genome yet. Might be worthwhile to filter those out...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/resendislab/diabetes_analysis/issues/2#issuecomment-340020097, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANrO2Yh63wGVHN0VUmz9q40mgO3x0M6Zks5swgTmgaJpZM4QIKju.

cdiener commented 7 years ago

Sure, glad to see we are seeing similar things :)

resendislab / mext2d

High variation in seq length after merging and high proportion of eukaryotes in tax assignment #2