marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
509 stars 128 forks source link

I have specified adapters for both read pair 1 and 2, but it seems the --pair-filter is still set to both #476

Closed spikeliu closed 3 years ago

spikeliu commented 3 years ago

If you report unexpected trimming behavior, this would also be helpful: One of your note in the documentation:

As an exception, when you specify adapters only for R1 (-a/-g/-b) or only for R2 (-A/-G/-B), then the --pair-filter mode for --discard-untrimmed is forced to be both (and accordingly, also for the --untrimmed-(paired-)output options).

Otherwise, with the default --pair-filter=any setting, all pairs would be considered untrimmed because it would always be the case that one of the reads in the pair does not contain an adapter.

The problem is that I have speified two adpaters for both read pair 1 and 2, and I created just one read pair as the test case, where read 1 contains ADP1 but read 2 contains no adapters at all. The expect result should be that this read pair would be seen in the bad.x.fq, right? But in fact, I got nothing in bad.x.fq, they appear in good.x.fq.

Could please check if there is something wrong. Thanks a million.

marcelm commented 3 years ago

Hi! The note in the documentation does not apply in your case because you use both -b and -B options. That means that the default --pair-filter=any setting applies. Thus, if R1 or R2 is untrimmed or both, the entire pair is considered to be untrimmed and ends up in good.?.fq.

To switch to the other behavior, just explicitly add the --pair-filter=both option.

marcelm commented 3 years ago

And please let me know if you have a suggestion of how the documentation could be improved.

spikeliu commented 3 years ago

Ok, I got it. Well, it is a little tricky, but it makes sense. I (also maybe some other people) only focus on the problematic reads, which are the trimmed one, I didn't realize when I use --untrimmed-output, the logic changes to "if read 1 or read 2 is untrimmed, they are both kept". Thenks for your explanation.

marcelm commented 3 years ago

Yes, it’s a bit tricky and I have to remind myself how it works every time.

I didn't realize when I use --untrimmed-output, the logic changes to "if read 1 or read 2 is untrimmed, they are both kept".

Hm, that’s interesting because you say "kept", but the idea I’m trying to convey (and how the code is written) is that the reads that get sent to the files specified via --untrimmed-output are actually the ones that are "discarded" (since they don’t end up in the regular output). In the documentation, I use the term "redirected". So I’d say "if any read is untrimmed, the pair is redirected".

Perhaps I should make this even more explicit (the idea that reads flow from input to -o/,-p output, but that you can "siphon them off" using options like --untrimmed-output etc.).

spikeliu commented 3 years ago

First I must thank you again for creating such a great software to solve a bunch of practical problems.

I think the reason I use "kept" is because I just want to use untrimmed reads to do further analysis. But, at the same time, some people like me may also want to have a glimpse at how reads with adapters (trimmed or kept untouched to -o or -p) look like or keep reads with adapters for other purposes, so we end up using such command combination. Another reason to separate these two kinds of reads for me is to test whether the setting is appropriate (are there any reads should be marked as "with adapters" got escaped), so I can tweak them according to the output.

When I first read your documentation, I would treat -o/-p (as regular output as you call them) as the final destination of what I desire. I think for most people, reads of good qulity and without adapters are always what they want, so when using --untrimmed-output, -o/-p serves as the destination of what most people ultimately don't want to keep, that might be a liite confusing?

Just off the top of my head, how about when people only use -o and/or -p, nothing changes. But if people use --untrimmed-output (--without-adapter-output), -o and/or -p are invalidate, people have to use --trimmed-output (or --with-adapter-output) to write out "trimmed" reads no matter what --action they decide to use, otherwise, "trimmed" reads would appear nowhere, not even in the standard output. In this way, you could force users to be clear of what they are trying to do and what kind of result they would get. The same logic may apply to other output options like --too-short-output (I know length-related output options are applied first, so things may get more compliated, so it dones't have to apply to every output option). In one word, if user want to separate the output into different groups (usually two), they have to specify the corespoding output option to have the specific group written out, things they don't mention in the command line would be omitted.

marcelm commented 3 years ago

I noticed I never replied to your last comment.

I think for most people, reads of good qulity and without adapters are always what they want

There are other equally valid use cases. For example, when you trim PCR primers, you usually want to have only the reads that did contain adapters (which in that case aren’t really adapters, but primers). And also when you sequence microRNAs, which are just around 20-25 nt long, reads that contain the adapter are the ones you’re interested in.

Adding a --trimmed-output option may sound good, and it might have been a good idea to design the command-line interface that way from the beginning, but that ship has sailed now: For backwards compatibility, I cannot change the way it works now too much, so --trimmed-output would essentially need to be an alias for --output. Even if it sounds like a simple change, needing to implement, test and document these things takes time, and because I consider the benefit to be low, I admit I’d rather spend my time on other things.

I just checked, and I think the documentation (both online and the --help output) are quite clear on what happens when you use --untrimmed-output, so I don’t know how to improve that. If you have a particular suggestion, please let me know, but for now, I’ll close this issue.