marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 125 forks source link

Demultiplexing favors the first barcode given?! #724

Closed fesel2 closed 10 months ago

fesel2 commented 10 months ago

Hello, i have problems with de-multiplexing some barcodes. I am new to bioinformatics, so i do not know if the problem is me or your software. But i made an interesting observation, which might be an issue for more people.

First of all: I installed cutadapt with anaconda from the bioconda channel. I currently have python 3.11.0 and cutadapt 4.4 installed. I am running the software on a node in a cluster with 20 CPUs and 50GB RAM.

Now the issue: After demultiplexing i ended up with very unequally distributed barcodes for my samples, which i wasn't able to explain. Especially the bc03 had like twice the amount of reads than the rest. For that reason i played around with different settings: I switched the accepted error-rate to 0 and rearranged the barcodes, putting bc07 and bc04 as the first. And voila suddenly distributions changed. To evaluate the amount of reads i ran fastqc and multiqc after each "experiment". I attached the chart corresponding to bc07 specified first.

cutadapt \ -j 0 \ -a bc07=CNNNNNNACCATAAGATC \ -a bc04=CNNNNNNACTCATAGATC \ -a bc03=CNNNNNNACCGGTAGATC \ -a bc05=CNNNNNNACGAACAGATC \ -a bc06=CNNNNNNACAGCAAGATC \ -a bc08=CNNNNNNACATAGAGATC \ -A universal=CCCAGATCGGAA \ -e 0 \ -o "$end_dir/trimmed_{name}_${paired_read[0]}" \ -p "$end_dir/trimmed_{name}_${paired_read[1]}" \ --minimum-length 24 \ "${paired_read[0]}" "${paired_read[1]}" This code results in this weird barcode distribution. image

Now note what happens when i switch the order of the barcodes, specifying barcode 04 first. cutadapt \ -j 0 \ -a bc04=CNNNNNNACTCATAGATC \ -a bc03=CNNNNNNACCGGTAGATC \ -a bc05=CNNNNNNACGAACAGATC \ -a bc06=CNNNNNNACAGCAAGATC \ -a bc07=CNNNNNNACCATAAGATC \ -a bc08=CNNNNNNACATAGAGATC \ -A universal=CCCAGATCGGAA \ -e 0 \ -o "$end_dir/trimmed_{name}_${paired_read[0]}" \ -p "$end_dir/trimmed_{name}_${paired_read[1]}" \ --minimum-length 24 \ "${paired_read[0]}" "${paired_read[1]}"

image

Now barcode 4 is found in a high amount, which is completely different than in the other run. I mean both times i used the same data. How is this even possible? Looking forward to your answers.

marcelm commented 10 months ago

Where is the barcode in the read? The -a option looks for adapters/primers/barcodes anywhere within the read. If the barcode is at the 5' end, it would be better use the -g option for 5' adapters and to anchor the sequence with ^, so it would be -g bc07=^CNNNNNNACCATAAGATC, for example (anchoring means that the match is required to be in the very beginning).

fesel2 commented 10 months ago

the adapter is at 3' presuming the RNA-fragment is short enough. Actually i didn't use the entire sequence of my adapters since i thought it isn't required when the rest is trimmed anyway using -a.

example read (forward): GGAAGNGCTACCACAACTTTAGCCATAATGTCACTTCTGCCGCGGGCATGCGGCCAGCCACTTTAGGACCGGTAGATCGGA (barcode 3)

actual bc03 adaptor: CNNNNNNACCGGTAGATCGGAAGAGCACACGTCTG

After your comment and reading the docs again i'll probably use the Non-internal 3' adapter feature with the entire adapter length. Looks more appropriate. Let's see if this solves the issue.

marcelm commented 10 months ago

If the data hasn’t been pre-processed in some other way already, I would expect that you would encounter three types of reads that look like this:

  1. Full RNA fragment + full adapter + something else (if the RNA fragment is short)
  2. Full RNA fragment + partial adapter (if the RNA fragment is a bit longer)
  3. Partial RNA fragment (if the RNA fragment is longer than the read)

If you use a regular adapter, Cutadapt finds it in case 1 and 2. If you use a non-internal adapter, it will only find it in case 2. (And nothing can be done about case 3 obviously.)

If the above is how your data looks, I think that using regular adapters is ok, but I would set the minimum overlap (using -O) quite a bit higher in order to ensure that partial matches below a certain length are not found.

This also leads to an explanation of what happens in your original example: The minimum overlap is 3 bases by default, and this includes even N wildcards. So I think what happens here is that any read that has a C close to the 3' end matches all of the provided adapters because they all start with CNNNNNN. Since Cutadapt picks the first provided adapter if many of them match equally well, you get a disproportionate amount assigned to the first one.

For example, your bc07 CNNNNNNACCATAAGATC would match the read ACGTACGT like this:

adapter:      CNNNNNNACCATAAGATC
              |||
read:    ACGTACGT

So I’d use something like -O 15 or larger. It’s a tradeoff: If it’s larger, the matches will be more specific, but also you will miss a greater number of partial matches.

fesel2 commented 10 months ago

I had a conceptional misunderstanding. Sure i do find the barcoded adapters at 3' end of the forward read if the read is small enough. But also the reverse complement in 5' anchored in the beginning of the reverse read. I don't know why i tried to use the 3' end for demultiplexing. Thank you a lot for your time and efforts to help me. Your comment helped me a lot in understanding what i am facing here.

marcelm commented 10 months ago

You’re welcome, I’m glad I could help! (It happens often that my answers aren’t actually the solution to the problem, but that they help clarify things anyway; that’s totally fine.)