readbio / ea-utils

Automatically exported from code.google.com/p/ea-utils
0 stars 0 forks source link

fastq-multx supporting sequence in header #31

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
PASTED from forums.  

This is an important feature that fastq-mcf should handle, but currently does 
not.   Also, I noticed that Illumina outputs GAGATTCC+GGCTCTGA for dual-indexed 
files.   It's not hard to do in the code, but it is a feature that I intend to 
add.

On Saturday, June 14, 2014 6:20:32 PM UTC-4, Christopher Laumer wrote:
Can fastq-multx (or any other tool that people know of) demultiplex PE fastq 
files based on the index sequence given in the sequence *headers*, not in the 
sequence itself?

For instance consider a 100 bp fastq looking like this (with a mate in a 
different file):

@ILLUMINA-D00365:240:H9N3RADXX:2:1101:2110:2045 1:N:0:GAGATTCCGGCTCTGA
AAGCCGGTATTTAAATATCTTATTGAAAAAATAATTTTATGGTTTGTTTTATTCTTTTAAATAAAATCTTTTAAATCAAC
TCTTTTTTATTCGGCTATTT
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJHJJJJJJJJJJJJJJJJJJJJJJHHHHHHFFFFF
FEEEEEEDDDDEDDDDDDDE

The index (here, two 8bp dual indices concatenated) is in the sequence name at 
the end ("1:N:0:GAGATTCCGGCTCTGA").

From all I can gather the normal behavior of fastq-multx is to look for the 
index within the sequence itself - but these are reads that have already been 
"demultiplexed" by CASAVA but using the wrong indices (so they made it into the 
"UndeterminedIndices" file... long story). 

Does anyone have any ideas how to handle this (or if fastq-multx can?). I 
really appreciate the input!

Original issue reported on code.google.com by earone...@gmail.com on 9 Jul 2014 at 2:12

GoogleCodeExporter commented 9 years ago

Original comment by earone...@gmail.com on 9 Jul 2014 at 2:28