jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

atropos does not seem to cut most of adapters and primers #22

Closed antonkulaga closed 7 years ago

antonkulaga commented 7 years ago

I've tried atropos in two modes ( with default downloadable adapters and with detection of adapters and using results of that run in trimming). It does not seem to clean at all. I compared it with sickle and the latest works much better, but still does not cut all Illumina primers. I enclose FASTQC reports. REPORTS.zip for original, atropos and sickle. The SRA that was used is SRR2040662

jdidion commented 7 years ago

Can you provide the atropos command line you used?

On May 31, 2017, at 3:30 PM, Anton Kulaga notifications@github.com wrote:

I've tried atropos in two modes ( with default downloadable adapters and with detection of adapters and using results of that run in trimming). It does not seem to clean at all. I compared it with sickle and the latest works much better, but still does not cut all Illumina primers. I enclose FASTQC reports. REPORTS.zip https://github.com/jdidion/atropos/files/1042693/REPORTS.zip for original, atropos and sickle. The SRA that was used is SRR2040662

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jdidion/atropos/issues/22, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHrntXsuDS72mAAmOhEVnRQWG4sAfPwks5r_b_ygaJpZM4NsFfK.

jdidion commented 7 years ago

FYI - the 'sra' branch has a new -sra option that streams reads directly from an SRA accession. This is based on my srastream library, which in turn requires the NCBI ngs-python library to be installed. This will be in Atropos 1.2 if not earlier.

antonkulaga commented 7 years ago

Here are my commands ($-rs are substituted by samples). For detection:

    /usr/local/bin/atropos detect -se ${file} -d heuristic -O fasta -o ${sampleName}_adapters.fasta

for triming:

/usr/local/bin/atropos trim --trim-n -se ${file} --known-adapters-file ${adapters} -o ${sampleName}"_trimmed.fastq.gz"

As known adapters I give the adapters detected in previous stage (I suspect that they should be combined with known adapters).

Note: I use latest master branch.

jdidion commented 7 years ago

Ah, ok. I think there's a misunderstanding here. A known adapters file simply provides a database from which you can access adapters by name; it does not search all your sequences for all of the adapters in the file. That would make trimming take much longer, since there is a linear time increase with each additional adapter searched. I have some ideas about how to make this faster, but that will be in a later version. Exhaustive search would also waste a lot of time since the known adapter file is a mix of forward and reverse adapter sequences.

For right now, you need to provide the adapters to trim in one of three ways:

The option would be -b instead of -a for 5' adapters.

Using the sra branch, I ran the following command:

atropos detect -sra SRR2040662 -d heuristic

I got back 6 sequences, all of which are at pretty low abundance (~0.1%). When adapters are this rare, sometimes the detect command can need more data, so I ran:

atropos detect -sra SRR2040662 --max-reads 50000 -d heuristic

Now Atropos picks up the most abundant adapter and matches it to a known sequence:

=======
Input 1
=======

File: SRR2040662
Detected 1 adapters/contaminants:
1. Longest kmer: CTGGAGTTCAGACGTGTGCTCT
   Longest matching sequence: CTGGAGTTCAGACGTGTGCTCTTCCGATCTAATTTT
   Name(s): IlluminaMultiplexingAdapter1
   Known sequence(s): GATCGGAAGAGCACACGTCT
   Known sequence K-mers that match detected contaminant: 100.00%
   Abundance (full-length) in 50000 reads: 52 (0.1%)
   Detected contaminant kmers that match known sequence: 36.00%
   Number of k-mer matches: 5592

Through some sleuthing, I figured out that the true adapter sequence for this dataset is GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTC.

If we align the reported known sequence, the longest matching sequence, the sequence detected by FASTQC, and the true adapter sequence, we get:

       GATCGGAAGAGCACACGTCT
AAAATTAGATCGGAAGAGCACACGTCTGAACTCCAG
          CGGAAGAGCACACGTCTGAACTCCAGTCACATCACG
       GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTC

So these are all matching the same thing. The challenge is that the read lengths here (36bp) are shorter than the adapter size, so the full adapter is never observed. But FASTQC does do a better job of picking up more of the true adapter sequence, so I'm going to have to look at their code and see how I can improve the heuristic algorithm.

The other issue is that this is a 5' adapter, and the detect command was optimized for 3' end adapter detection, so I need to do some work to generalize it for either case. It's pretty easy to detect 5' vs 3' by just looking at what end you find long runs of As.

Anyway, I then trimmed using the known sequence:

atropos trim -T 4 -sra SRR2040662 -b CTGGAGTTCAGACGTGTGCTCT -o test.fq

The resulting FastQC report: test_fastqc.zip.

Notice that the second-most over-represented sequence in the original has reduced by ~60%. You could improve this further with some more parameter tuning, for example by enforcing a minimum read length (-m {min read length}). But if I understand what you're trying to do -- automated adapter trimming -- realize that it's a hard problem and that results will vary dramatically from dataset to dataset no matter what trimming tool you use.

jdidion commented 7 years ago

My plan is to close this issue and open others:

jdidion commented 7 years ago

The SRA streaming feature is included in v1.1.7.

antonkulaga commented 7 years ago

atropos trim -T 4 -sra SRR2040662 -b CTGGAGTTCAGACGTGTGCTCT -o test.fq The resulting FastQC report: test_fastqc.zip. Notice that the second-most over-represented sequence in the original has reduced by ~60%.

Nothing changed, all overrepresented sequences have same ammount of counts as in original, maybe you confused the file when uploading the report?

UPD: sorry, did not notice that it cut part of the sequences. The problem is that the number of overrepresented sequences remained the same (it could not cut TCCGATCT part)

antonkulaga commented 7 years ago

But if I understand what you're trying to do -- automated adapter trimming -- realize that it's a hard problem and that results will vary dramatically from dataset to dataset no matter what trimming tool you use.

We have to deal with a huge amount of different SRA-s in the lab and writing down adapters and primers manually for each set of samples is definitely an overkill for us. That means that I will be happy if it will detect (and then clean) at least those adapters and primers that I see in FASTQC. It is a bit surprising that FASTQC, a more general tool, does the job of detecting adapters/primers better than atropos, the specialized tool to delete adapters.

jdidion commented 7 years ago

I understand the use case -- that's why I implemented the 'detect' command. Just understand that it was very recently developed and optimized for long, paired-end sequences. FASTQC has been around forever, and was created back when 36 bp reads were the norm, so I'm not surprised that it's better at handling those.

Atropos will get better with help from users like you. One thing people can do to help is submit complete bug reports. Instead of just saying "it doesn't work," also upload all the relevant information - inputs, expected outputs, actual outputs, stack traces, etc.

In the interim, for datasets where Atropos doesn't perform as well as FASTQC (single-end, shorter reads) you can write a script to extract the over-represented sequences from the FASTQC report, save them to FASTA, and feed those to atropos.

antonkulaga commented 7 years ago

that's why I implemented the 'detect' command

In Detect command it will be good to differentiate adapters detected in 3prime and 5prime sites inside fasta output file, so I will not have to bother what should go to atropos A and atropos B

you can write a script to extract the over-represented sequences from the FASTQC report,

For them I do not know if they are A and B, should I put them both to A and B?

jdidion commented 7 years ago

In most datasets there will only one or the other. The detect command outputs one fasta file per input read (so one fasta for single-end, two for paired-end). The additional bit needed is to determine whether the sequencing protocol used 3' or 5' adapters.

I'm thinking about allowing a config file to supplement options passed on the command line, so a reasonable first step would be for detect to also output a config file that has the suggested -a/-b/-g options set. How does that sound?

On Jun 3, 2017, at 8:37 AM, Anton Kulaga notifications@github.com wrote:

that's why I implemented the 'detect' command

In Detect command it will be good to differentiate adapters detected in 3prime and 5prime sites inside fasta output file, so I will not have to bother what should go to atropos A and atropos B

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jdidion commented 7 years ago

you can write a script to extract the over-represented sequences from the FASTQC report,

For them I do not know if they are A and B, should I put them both to A and B?

First, I made a mistake before when I said '-b' (which means to search for the adapter anywhere in the read) I meant '-g' (which means to search for it at the front of the read. For the temporary FASTQC solution, you can use the '-g' option.

jdidion commented 7 years ago

I just added some new issues to cover the enhancements proposed here, so I'm closing this one.