richardsprague / uBiome

useful tools for manipulating uBiome information
45 stars 1 forks source link

Sample IDs in fastq data #1

Open pschloss opened 9 years ago

pschloss commented 9 years ago

I just took a quick look at your fastq files in the fastq-spragueuBiomeJan2015 folder. A couple things jump out at me...

  1. Do you know which body sites these are for? Does each lane's worth of data represent a different body site? Are there different body sites within each lane? What I'm seeing is that each of the four lanes of data (L001, L002, L003, L004) seem to start with the AGGGT barcode index. There do appear to be one base variants of that barcode. For example, from L001 I see these barcode indices show up with these frequencies...
16927 AGGGT
7263 AGCGT
9149 AGAGT
9399 AGTGT

I would be shocked if these were 4 different samples as they only diffs by one base in the middle position and if there were any errors at all it would easily get miscalled. Then again, I would also be shocked if they were sequencing variants of the AGGGT index since that would indicate a very high rate of sequencing error.

  1. These data were most likely generated on an Illumina HiSeq or GAII sequencer. The reads are paired 150 nt reads. They sequenced the V4 region of the 16S rRNA gene, which is 250 nt long. This means that there would be about 50 nt of overlap. Given the high error rate of the system (see the barcodes above), it is not surprising that when I run make.contigs in mothur, most of the contig sequences have an ambiguous base in them. This happens because in the overlapping region the quality scores for the base calls are so low that it's not possible for one read to denoise the other. While this may not be a big deal for classification-based analysis, it pretty well kills the ability to do an OTU-based analysis.
richardsprague commented 9 years ago

those should be for a single body site (i.e. gut)

Do you get similar results on the other fastq set: https://github.com/richardsprague/uBiome/tree/master/Data/fastq-spragueuBiomeOct2014 https://github.com/richardsprague/uBiome/tree/master/Data/fastq-spragueuBiomeOct2014 (that’s also gut-only, from October)

On Mar 19, 2015, at 12:41 PM, Pat Schloss notifications@github.com wrote:

I just took a quick look at your fastq files in the fastq-spragueuBiomeJan2015 folder. A couple things jump out at me...

Do you know which body sites these are for? Does each lane's worth of data represent a different body site? Are there different body sites within each lane? What I'm seeing is that each of the four lanes of data (L001, L002, L003, L004) seem to start with the AGGGT barcode index. There do appear to be one base variants of that barcode. For example, from L001 I see these barcode indices show up with these frequencies... 16927 AGGGT 7263 AGCGT 9149 AGAGT 9399 AGTGT I would be shocked if these were 4 different samples as they only diffs by one base in the middle position and if there were any errors at all it would easily get miscalled. Then again, I would also be shocked if they were sequencing variants of the AGGGT index since that would indicate a very high rate of sequencing error.

These data were most likely generated on an Illumina HiSeq or GAII sequencer. The reads are paired 150 nt reads. They sequenced the V4 region of the 16S rRNA gene, which is 250 nt long. This means that there would be about 50 nt of overlap. Given the high error rate of the system (see the barcodes above), it is not surprising that when I run make.contigs in mothur, most of the contig sequences have an ambiguous base in them. This happens because in the overlapping region the quality scores for the base calls are so low that it's not possible for one read to denoise the other. While this may not be a big deal for classification-based analysis, it pretty well kills the ability to do an OTU-based analysis. — Reply to this email directly or view it on GitHub https://github.com/richardsprague/uBiome/issues/1.

pschloss commented 9 years ago

Yep, I get pretty much the same thing for each of the four lanes in the October data:

11171 GATGT
12428 GAAGT
13328 GAGGT
7156 GACGT

FWIW, here are the commands I've been running...

mothur "#fastq.info(fastq=ssr_8445__R1__L001.fastq)"
grep -v ">" ssr_8445__R1__L001.fasta | cut -c -5 | sort | uniq -c | sort -d