marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
511 stars 130 forks source link

Demultiplexing with seperate index file #491

Open nr0cinu opened 3 years ago

nr0cinu commented 3 years ago

Hi!

Raw Illumina reads sometimes come with the Index read as a separate file, so you get three files like that: S0_L001_I1_001.fastq.gz S0_L001_R1_001.fastq.gz S0_L001_R2_001.fastq.gz

Where S0_L001_I1_001.fastq.gz contains the barcodes/indices. Is there a way to use this with cutadapt? My current workaround is to concatenate I1+R1 for each read and then use cutadapt.

Thanks! Best, Bela

marcelm commented 3 years ago

There is currently no way to use these files. Can you say what your use case is? I’m asking because I never look at these index files myself since our sequencing provider does the demultiplexing for us (using some Illumina tool I assume). Lots of code in Cutadapt assumes that the data comes from one or two files, so it may be difficult to add this feature.

I don’t have an I1 file at hand. Can you post on example of how a record in it looks?

nr0cinu commented 3 years ago

Hi,

Can you say what your use case is?

I often see this format, when the sequencing provider does not do any demultiplexing, and just exports the data as FASTQs. I assume it is produced by the Illumina software.

Lots of code in Cutadapt assumes that the data comes from one or two files, so it may be difficult to add this feature.

Since there is an easy workaround, I guess it's not worth it then to add this feature.

I don’t have an I1 file at hand. Can you post on example of how a record in it looks?

The I1 file looks identical to the R1 file, but the sequence data is just the Illumina indices. Something like this:

@M00000:1:000000000-ABCDE:1:1234:56789:1234 1:N:0:0
GAGTACGTTCAT
+
AAA1BFFFF@FF

Thanks :)

Best, Bela

marcelm commented 3 years ago

Thanks for showing how the file looks. I’ll come back to this if anyone else asks. (I guess not everyone would find the workaround easy.)

yxian9 commented 2 years ago

Hi! I'm actually looking for the solution for the exactly same question. Marcel you are right. In most case, Illumina bcl2fastq will do the demultiplexing. In some case, I use complicated index strategy, which can only be handle by Cutadapt. ( Thanks for providing this amazing tool).

In this case, Illumina will provide R1, R2 fastq reads and I1, I2 (dual-index). The I1,I2 reads have identical read header as R1 and R2, within the reads entry (second line), it contains 10 bp index reads sequence, exactly as nr0cinu explained.

Currently, I'm using a simple python script to create a new temp R2 file, with the Index reads sequence and quality score append to the end of original R2 reads, when following the cutadapt documentation to do the demultiplex. Just wondering if is there any easy workaround available.

marcelm commented 2 years ago

I won’t have time to work on this in the near future, but let’s re-open this so it won’t get forgotten.

peebeenojay commented 1 year ago

Hi! I'm also looking for a solution for this. I have R1 and R2 fastq files, that come with separated index I1 and I2 fastq files. I have not found a way to use all these files successfully with cutadapt or AdapterRemoval.

I have ran into a few datasets that provide the files like this -- fortunately, the sequencing company also sent the already demultiplexed files, so I am not stuck on this. But regardless, it would be good to have a solution to use these files for demultiplexing.

Best, Cátia