Automatically detect adapter sequences during trim

jdidion commented 7 years ago

This issue is to discuss the right way to implement automated adapter detection and trimming: https://github.com/jdidion/atropos/projects/1

jfear commented 7 years ago

Here I try to briefly describe my use case, which I think will become more common.

Motivation: With the explosion of sequencing, there are now large number of public datasets. As new experiments are conducted, it is useful to look at any new data in light of previous results. For example: if I conduct a new experiment focusing on a particular tissue or cell type, then it would be useful to compare my sample's expression properties with all other samples from the same tissue or cell type.

Approach: To do these types of analyses it is important that the data have been treated similarly, to do this all data of interest needs to be downloaded from the SRA and everything should be processed using a similar workflow. For the project I am working on in particularly, we are downloading tens of thousands of samples from SRA and remapping them to the same genome release. Unfortunately, technical metadata (i.e., information about the library and sequencing) are not always accurate or complete. To deal with these limitations we are using properties of the data to inform processing decisions. One of these processing steps is to decide if data has large amounts of adapter contamination that would require trimming of the reads. Because metadata describing these technical aspects may be missing or wrong, we don't know exactly what adapters to expect. Having a tool that could detect potential adapters and them trim them could be very useful.

While modern aligners use methods like soft trimming during alignment to reduce the need for pre-trimming, I have noticed that there are cases where read trimming does increase overall mappability by at least a few percents. However, I don't really want to pre-trim because tools like HISAT2 can download accessions directly and map them. But what I am interested in is getting metrics out about potential adapater contaminants that I can then re-visit particular samples if need be.

antonkulaga commented 7 years ago

Looks like adapter detection even in heuristics mode is quite fast, that means this feature will be quite useful

jdidion commented 7 years ago

Great! Note that it’s only sampling 10k reads. If adapters contamination is rare, you may need to increase the sample size (using the —maxreads option).

On May 12, 2017, at 5:36 PM, Anton Kulaga notifications@github.com wrote:

Looks like adapter detection even in heuristics mode is quite fast, that means this feature will be quite useful

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jdidion/atropos/issues/6#issuecomment-301191753, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHrnjQwpyvEKpS8857QEY5zVgCZeb6jks5r5NDRgaJpZM4MP1gV.

antonkulaga commented 7 years ago

It would be nice if it will detect multiplexing primers and other typical contamenants with default parameters

jdidion commented 7 years ago

@jfear the 'sra' branch of Atropos now includes direct streaming from an SRA accession using the '-sra' option. You need to have ngs-python installed.

jfear commented 7 years ago

Thanks @jdidion, that will be super helpful!! Will let you know if I run into any problems.

antonkulaga commented 7 years ago

Any estimation on 2.0 release time? I have a similar usecase to @jfear and this feature will be superhelful for our lab

jdidion commented 7 years ago

Nope, no estimate. I've started a new job recently and the amount of free time I have for this and other projects is highly variable. Contributions are always welcome.

jdidion commented 6 years ago

This issue is partially duplicated by #60 and #65. @jfear @antonkulaga, do you think if those two issues are implemented, that will solve the use case for this issue?

jfear commented 6 years ago

Sounds interesting, I think it would address the use case. This will only work for pair-end data correct?

jdidion commented 6 years ago

Yes that’s right.

On Mar 16, 2018, at 7:09 PM, Justin Fear notifications@github.com wrote:

Sounds interesting, I think it would address the use case. This will only work for pair-end data correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jdidion/atropos/issues/6#issuecomment-373868644, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHrnhrnqIJEeYTLKccg5bWG2t2mqGXmks5tfEY4gaJpZM4MP1gV.

jdidion / atropos

Automatically detect adapter sequences during trim #6