jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

Detect command should determine whether adapters are 5' or 3' #26

Open jdidion opened 7 years ago

jdidion commented 7 years ago

The detect command is currently optimized for 3' adapter detection. It should be possible to detect which end the adapters are on by looking for where long runs of As occur.

wckdouglas commented 6 years ago

Hi, I would be happy to contribute to this. Am I correct that "long runs of As" are as defined by past_end_bases under Detector class?

jdidion commented 6 years ago

Thanks @wckdouglas! Yes that's correct - at least for Illumina data, when the sequencer reads through the template and the opposite adapter completely, the following bases tend to be a run of 'A's. Currently I use a threshold of at least 8 bp to detect complete adapter read-through. For the detect command, you could make the pattern that is searched for configurable via a command line option.

wckdouglas commented 6 years ago

@jdidion ah thanks, I see! I am seeing the current implementation uses the option -past-end-bases that accepts regex, so it seems to take care of it. Or should it be detecting using the Aligner().locate() method if the pattern is customized?

jdidion commented 6 years ago

The code currently does the correct thing for 3' end adapters, but it assumes the adapters will always be at the 3' end. The idea is to search for runs of A's and determine whether they tend to occur towards the beginning or the end of the read. This would need to use some statistical test, as runs of A's will be distributed at various positions, but there should be a 'peak' toward the left or the right side for 5' or 3' adapters, respectively. Does that make sense?

wckdouglas commented 6 years ago

Thanks for the explanation, I think I am starting to get it. So am I right that we are assuming (1) sometimes adapter are not on the 3' end of the reads and bases after the adapter maybe biological meaningful. this seems impossible based on Fig 1A from the atropos peerj paper. (2) most of the time long runs of A's occur after adapter, these are the empty cycles so they have to be removed like atropos currently does before adapter detections, and (3) long runs of A's maybe biological sequences and in these cases, they should not be removed.

Anyways, would it make sense if I start implementing a counter to record the positions of runs of A's? Also is there a particular test fastq that would be good for testing this function?

jdidion commented 6 years ago

Sorry for taking so long to get back to you.

There are some library preparations in which adapters are added to the 5' ends of reads (see the section "5' adapters" in the documentation). These are less common, but we still want to handle them in Atropos.

In the Atropos paper, the focus is on the improved paired-end read trimming, as this is the major improvement from Cutadapt. Currently, only 3' end adapters are supported by the insert aligner. However, 5' adapters can still be handled (using -g/-G or -b/-B).

You are correct in points #2 and #3. However, I was partially wrong in what I said earlier - empty cycles won't occur at the 5' end of the read. The only reason for a 5' adapter not matching at the start of the 5' end is if the sequence is degraded. Thus, long runs of A's at the 5' end are not a signal of 5' adapters. Instead, I think the best that can be done is to build up a distribution of the positions of long runs of A's in a subsample of reads, and if it is shifted far enough to the 3' end, you can be pretty confident that you have 3' adapters. Otherwise you tell the user that the adapter type cannot be determined.

For test data, you can look at the datasets I use in the paper. If you look at paper/containers/data/rna/build.sh and /paper/containers/data/wgbs/build.sh you can see how to download these data from SRA. However, these data both use 3' adapters. I am not sure where to get data that uses 5' adapters - you might have to simulate it.

jdidion commented 6 years ago

@wckdouglas if you'd like to work on this issue (i.e. if it will help you in your own work) please, by all means. However, I think issue #65 is higher priority and would be more straight-forward to implement.

wckdouglas commented 6 years ago

Agree, this issue (#26) will require some careful thoughts and design. I will see what I can do with #65, which will likely be a very helpful feature for my own workflow as well.

jdidion commented 6 years ago

Excellent, thanks.