marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 126 forks source link

Output of demultiplexing mixed oriented reads #764

Open luigallucci opened 4 months ago

luigallucci commented 4 months ago

Hi there!

I'm following the guidelines for the mixed oriented reads case, trying to demultiplex a bunch of 4 different files (x_R1.fastq.gz, x_R2.fastq.gz, y_R1.fastq.gz and y_R2.fastq.gz).

After running the pipeline until round 2, then reversing the files and processing everything again...do I have to merge the resulting files via the cat command? What is the procedure you would recommend? error rate suggestion? I'm using actually a pipeline which has an error set to 0.15.

luigallucci commented 4 months ago

To add additional context and an additional question:

I'm using Cutadapt 4.5 and python 3.10 with the following script to demultiplex the files that I have:

`/bioinf/home/lgallucc/software/miniconda3/envs/tagseq/bin/cutadapt \ -j 30 \ -e 0.15 \ --no-indels \ --minimum-length 100 \ -g ^file:${fwdbarcode} \ -o ${dir}/round1{name}.R1.fastq.gz -p ${dir}/round1_{name}.R2.fastq.gz \ ${input_directory_R1} ${input_directory_R2} \

cutadaptcombindex${run}r1.log `

Through this I have a low number of reads assigned to different output files and an high number of reads in the unknown file produced following the instruction of the User guide.

So, doing the second steps in this order:

`/bioinf/home/lgallucc/software/miniconda3/envs/tagseq/bin/cutadapt \ -j 30 \ -e 0.15 \ --no-indels \ -g ^file:${fwdbarcode} \ -o ${dir}/round2{name}.R2.fastq.gz -p ${dir}/round2_{name}.R1.fastq.gz \ ${dir}/round1_unknown.R2.fastq.gz ${dir}/round1_unknown.R1.fastq.gz \

cutadaptcombindex${run}r2.log`

gave me another bunch of files as output with a reasonable number of reads for file,but still having 70000 reads on R1 and 92000 reads on R2.

I was trying to understand, as we face a lot of problem recently in our group, what was going on...so for curiosity I tried to use -b insted of -g, and only using the first step without doing the second reverse-forward.

This gave me as output a good distribution of files with just 20 and 24 reads in the respective unknown R1 and R2.

Now, my question is...this solution could be fine? is still producing mixed oriented fastq files in the output? How that should be possible? error in sequencing?

marcelm commented 4 months ago

After running the pipeline until round 2, then reversing the files and processing everything again...do I have to merge the resulting files via the cat command?

Yes, I forgot this in the guide: If you want one file per barcode, you need to concatenate the files, and that is indeed most easily done with cat. (Note that you can concatenate gzip files without decompressing them: cat first.gz second.gz > both.gz.)

That said, you may want to instead update to Cutadapt 4.6 and use the --revcomp option: With it, only one round is needed. Cutadapt will swap R1 and R2 for you (equivalent to reverse complementing) and retain the orientation that looks best (that is, for which it finds a barcode in the expected location).

I just updated the documentation. If you try it, can you please let me know whether it works as described (I haven’t had time to test it myself)?

What is the procedure you would recommend? error rate suggestion? I'm using actually a pipeline which has an error set to 0.15.

Assuming these are Illumina sequences and your barcode is not much longer than 10 bases, allowing one error is probably good enough. An error rate of 0.15 probably gets you there, but note that recent Cutadapt versions allow you to specify the allowed number of errors directly as in -e 1.

but still having 70000 reads on R1 and 92000 reads on R2.

What percentage of the total is that and how many barcodes do you have?

for curiosity I tried to use -b insted of -g, and only using the first step without doing the second reverse-forward.

This gave me as output a good distribution of files with just 20 and 24 reads in the respective unknown R1 and R2.

This is the expected outcome, but not the result you want because this comes from matches caused by chance: Using -b implies that partial matches are allowed. As an example: If your barcode is AAAAACGT and your read starts with CGT, then this is considered a match. Since you have multiple barcodes, it is very likely that at least one of them matches partially any given read. The effect is exacerbated by using -b because that option allows partial matches both at the 5' and 3' end, so each read gets two chances for a partial match.

Now, my question is...this solution could be fine? is still producing mixed oriented fastq files in the output? How that should be possible? error in sequencing?

You could start by trying to verify that the barcode is indeed where it is supposed to be. You can use Cutadapt for that. For example, use -g "file:barcodes.fasta;o=99" (note: no ^). This allows matches to start anywhere within the read (but partial matches are disabled because of o=99, which sets the minimum overlap to the barcode length). In the statistics, look at the length of removed sequences: If it is often longer than the barcode length, then the barcode was found not at the 5' end. Not sure how helpful this analysis is, but maybe it gives you an idea for other things you could check.

luigallucci commented 4 months ago

That said, you may want to instead update to Cutadapt 4.6 and use the --revcomp option: With it, only one round is needed. Cutadapt will swap R1 and R2 for you (equivalent to reverse complementing) and retain the orientation that looks best (that is, for which it finds a barcode in the expected location).

This function is really worthing the update! Thank you for such a development, and yes, I have the exact same number that I obtained going through the separated steps...anyway I was not going in deep analysis, just some reads count after that, but for now seems working. Do you think should be useful also in removing the primer in paired ends mixed oriented reads?

but still having 70000 reads on R1 and 92000 reads on R2. What percentage of the total is that and how many barcodes do you have?

I have 96 barcode, and the total amount of starting reads are 1096025 (double checked with fastqc).

This is the expected outcome, but not the result you want because this comes from matches caused by chance: Using -b implies that partial matches are allowed. As an example: If your barcode is AAAAACGT and your read starts with CGT, then this is considered a match. Since you have multiple barcodes, it is very likely that at least one of them matches partially any given read. The effect is exacerbated by using -b because that option allows partial matches both at the 5' and 3' end, so each read gets two chances for a partial match.

This is completely true, as I tried to get a fastqc from the -b output and the distribution, kmer and other stats were completely bad.

You could start by trying to verify that the barcode is indeed where it is supposed to be. You can use Cutadapt for that. For example, use -g "file:barcodes.fasta;o=99" (note: no ^). This allows matches to start anywhere within the read (but partial matches are disabled because of o=99, which sets the minimum overlap to the barcode length). In the statistics, look at the length of removed sequences: If it is often longer than the barcode length, then the barcode was found not at the 5' end. Not sure how helpful this analysis is, but maybe it gives you an idea for other things you could check.

In addition, I tried with this last setting that you suggested but the only output that I got is this:

`Processing paired-end reads on 30 cores ... Done 00:00:51 1,096,025 reads @ 47.3 µs/read; 1.27 M reads/minutee Finished in 53.319 s (48.647 µs/read; 1.23 M reads/minute).

=== Summary ===

Total read pairs processed: 1,096,025

== Read fate breakdown == Pairs discarded as untrimmed: 0 (0.0%) Pairs written (passing filters): 1,096,025 (100.0%)

Total basepairs processed: 648,012,492 bp Read 1: 321,285,706 bp Read 2: 326,726,786 bp Total written (filtered): 575,116,774 bp (88.8%) Read 1: 250,592,133 bp Read 2: 324,524,641 bp`

From the reads distribution, seems that is better divided than before...but is still unknown how much this should be the right processing or not...do you have any idea how cutadapt is not producing any type of output? I just added 0=99 removing ^ at the beginning.

marcelm commented 3 months ago

I just added 0=99 removing ^ at the beginning.

Just a simple check before looking into this any further: Did you enclose the text in quotation marks? Otherwise, the shell will interpret the semicolon as a separator:

$ echo hello;world
hello
world: command not found

So it needs to be something like "file:barcodes.fasta;o=99". Cutadapt also prints out the command-line options that were used to run it. Check that line to see whether that matches what you put in.

luigallucci commented 3 months ago

Yes, the problem was this related to that command, but... the problem is that using --revcomp as option the command is working, but the output is not detailed, just:

This is cutadapt 4.6 with Python 3.10.12
Command line parameters: -j 30 --revcomp -e 0.1 --no-indels --minimum-length 290 --discard-untrimmed -g ^file:/bioinf/home/lgall$
Building index of 96 adapters ...
Built an index containing 96 strings in 0.0 s.
Processing paired-end reads on 30 cores ...
Finished in 11.068 s (10.099 µs/read; 5.94 M reads/minute).

=== Summary ===

Total read pairs processed:          1,096,025

== Read fate breakdown ==
Pairs that were too short:              51,820 (4.7%)
Pairs discarded as untrimmed:          124,605 (11.4%)
Pairs written (passing filters):       919,600 (83.9%)

Total basepairs processed:   648,012,492 bp
  Read 1:   321,285,706 bp
  Read 2:   326,726,786 bp
Total written (filtered):    546,242,400 bp (84.3%)
  Read 1:   269,442,800 bp
  Read 2:   276,799,600 bp

well, in that case, I got that some reads were really too short, so for having a more complete statistic, in absence of the detailed one for each barcode, I added --discard-untrimmed,-e 0.1, -g "^file:barcode.fasta;o=99".

The first feedback, is related to this...no the usual output using --revcomp.

While...I was unaware that not all the barcode were corresponding to samples, but now that I have a more clear view, I know that up to 48 was actually samples, after that no. So, checking with fastqc, I got that for the barcode after 49 all the fastq files had just 1 read for each. Finally, following this data, the output is quite nice (even if there are still 11,4% of unassigned) and for the other barcode (no corresponding to samples) I think that 1 reads is quite enough to say that that option combined are a good setting. Probably, relaxing a little bit the -e value could results in a lower untrimmed reads value...do you have suggestions?