Demultiplexing paired-end reads with dual barcodes and primers

koopkaup commented 6 years ago

Hi, I have Illumina MiSeq reads in the following format forward_barcode_sequence forward_primer_sequence read reverse_primer_sequence reverse_barcode

Because forward barcodes is used in combination with reverse barcodes, searching only the first barcode as cutadapt does right now does not work. Is it possible to add an option to search for both barcodes simultaneously to decide where a read belongs to?

marcelm commented 6 years ago

You should be able to use a linked adapter for this purpose. Can you try this and let me know whether it works? I’ll then make it clearer in the documentation.

koopkaup commented 6 years ago

I will try this, but how can I provide linked adapters as a fasta file? As shown in the example: -a file:barcodes.fasta

marcelm commented 6 years ago

Put them in the FASTA file like this:

>adapter1
ACGT...AACCGGTT
>adapter2
TTAAGG...CCAA

marcelm commented 6 years ago

I guess that, since you use file:, you probably have more than a few adapters. In that case, putting them into the FASTA file as I said in the comment above requires you to list all possible combinations. Depending on how many there are, this could be a bit inefficient and you may be better off running cutadapt in a "nested" way: Run it once to demultiplex according to the forward barcode, and then run it on all the output files to demultiplex according to the reverse barcode.

marcelm commented 6 years ago

Sorry, I should have probably read the title more thoroughly. I skipped the fact that you have paired-end reads and the above will only work as I suggested when you merge the reads before running cutadapt.

I will consider adding an option to make this easier.

marcelm commented 6 years ago

Can you clarify whether your reads look like you describe above or whether it is the DNA fragment that you were describing? (forward_barcode_sequence forward_primer_sequence read reverse_primer_sequence reverse_barcode)

koopkaup commented 6 years ago

My reads are in that format.

DenisGoryunov commented 6 years ago

Hi, It seems i have the same kind of data. Please check my recent post on Biostars: https://www.biostars.org/p/324429/#324738

marcelm commented 6 years ago

Could one of you send me some small part of your dataset (with just a couple of reads)? I would also need to know what the forward_barcode_sequence, forward_primer_sequence, reverse_primer_sequence, reverse_barcode sequences are (these aren’t random barcodes I assume). You can send this to me privately at marcel.martin@scilifelab.se, but mention it here if you have done so.

However, note that I will not have time to work on this until middle of August at the earliest.

DenisGoryunov commented 6 years ago

Hi, Unfortunately my data are already demultiplexed by sequencing facility. But to my understanding the task is exactly the same (see detailed discription on Biostars). In case you still need my data please let me know. This may be useful for understanding of dual index technology: https://www.drive5.com/usearch/manual/pipe_demux.html

marcelm commented 5 years ago

As a summary for myself: There are two different dual indexing strategies used by Illumina

Combinatorial indexing uses 8 different i5 indices and 12 different i7 indices. These used together allow for 96 combinations.
Non-redundant indexing (unique dual indexes, UDI) uses 96 unique i5 indices and 96 unique i7 indices. Apparently, these are only ever used in pairs, that is, the first i5 index is always used with the first i7 index and so on.

For dealing with this type of data in Cutadapt, we need two options.

For combinatorial indexing, an idea could be to allow not only {name} in the demultiplexing file name template, but perhaps something like {name1}{name2}, where {name1} is the name of the adapter (barcode) that was found in R1 and {name2} is the name of the adapter (barcode) that was found in R2.
For UDI, the --pair-adapters option suggested in #347 would be necessary.

koopkaup commented 5 years ago

Have you come up with a solution for the first situation where there are multiple combinations of indices? We have done sequencing like this and I would like to try out your method.

marcelm commented 5 years ago

My plan is to implement the idea where you can use something like {name1}{name2} in the file name templates, as mentioned above. I don’t know when I have time for this, hopefully this month.

For completeness: The second part, which is the --pair-adapters option, is implemented.

ArnavGuptaa commented 5 years ago

Any update regarding combinatorial indexing?

marcelm commented 5 years ago

I’m working on this now, give me a couple more days.

marcelm commented 5 years ago

Hi, this is now implemented ("combinatorial demultiplexing"). Please read the new section in the documentation.

It would be great if someone could test it and let me know whether this is what you need before I make a new release. Just follow the instructions for how to install a development version.

koopkaup commented 5 years ago

Thanks! I can try it next week and let you know how it worked.

marcelm / cutadapt

Demultiplexing paired-end reads with dual barcodes and primers #292