Closed Starnite closed 7 years ago
I get the same in length distribution
Min. 1st Qu. Median Mean 3rd Qu. Max.
230.0 272.0 272.0 287.1 284.0 399.0
which I think is ok if you are explicitly merging forward/reverse reads. The min and max are pretty much what you would expect since for forward reads we trim the first 10 bps and truncate the length to 240 (so length = 240 - 10). For the merged ones you need a minimum overlap of 20 and the reverse reads have a max length of 190 which gives a maximum merged length of (400 = 230 + 190 - 20). However most merged seqs fall into the IQR of 272-284 which indicates there is a lot of overlap which is good :)
For the kingdom i get the following:
Archaea Bacteria Eukaryota
48 10058 6658
So there are a lot of eukaryota but I would think that is a bit inflated since 16S is pretty bad in discerning eukaryotes and the DADA2 error model might not work to well here. In terms of reads those contribute very little:
❯ sum(seqtab[, taxa[,1] == "Eukaryota"], na.rm = T)
[1] 161710
❯ sum(seqtab, na.rm = T)
[1] 54099558
So that is not even 1% of the reads... I did not try to align the reads to the human genome yet. Might be worthwhile to filter those out...
Hi Christian,
Thanks for the info - glad to have the confirmation.
Cheers, Vivian
From: Christian Diener notifications@github.com Reply-To: resendislab/diabetes_analysis reply@reply.github.com Date: Friday, October 27, 2017 at 12:29 PM To: resendislab/diabetes_analysis diabetes_analysis@noreply.github.com Cc: Vivian Zhong vivzhong@mit.edu, Author author@noreply.github.com Subject: Re: [resendislab/diabetes_analysis] High variation in seq length after merging and high proportion of eukaryotes in tax assignment (#2)
I get the same in length distribution
Min. 1st Qu. Median Mean 3rd Qu. Max.
230.0 272.0 272.0 287.1 284.0 399.0
which I think is ok if you are explicitly merging forward/reverse reads. The min and max are pretty much what you would expect since for forward reads we trim the first 10 bps and truncate the length to 240 (so length = 240 - 10). For the merged ones you need a minimum overlap of 20 and the reverse reads have a max length of 190 which gives a maximum merged length of (400 = 230 + 190 - 20). However most merged seqs fall into the IQR of 272-284 which indicates there is a lot of overlap which is good :)
For the kingdom i get the following:
Archaea Bacteria Eukaryota
48 10058 6658
So there are a lot of eukaryota but I would think that is a bit inflated since 16S is pretty bad in discerning eukaryotes and the DADA2 error model might not work to well here. In terms of reads those contribute very little:
❯ sum(seqtab[, taxa[,1] == "Eukaryota"], na.rm = T)
[1] 161710
❯ sum(seqtab, na.rm = T)
[1] 54099558
So that is not even 1% of the reads... I did not try to align the reads to the human genome yet. Might be worthwhile to filter those out...
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/resendislab/diabetes_analysis/issues/2#issuecomment-340020097, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANrO2Yh63wGVHN0VUmz9q40mgO3x0M6Zks5swgTmgaJpZM4QIKju.
Sure, glad to see we are seeing similar things :)
Hi @cdiener ,
When you were running dada2 on the data, did you notice a high variation in sequence lengths after sequence inference and merging paired reads? I have sequences that vary in length from 230 to 399.
In addition, taxonomic assignment on these sequence variants has more than 50% of the unique variants being eukaryotic sequences.
My chimera slaying results seem to be in line with yours.
Thanks, Vivian