SuperDeduper ignoring reads

s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes

https://s4hts.github.io/HTStream/

Apache License 2.0

49 stars 9 forks source link

SuperDeduper ignoring reads #238

Closed channsoden closed 2 years ago

channsoden commented 4 years ago

Testing SuperDeduper on 26bp single reads it ignores all reads, despite high qualities and lack of Ns. Why is this?

samhunter commented 4 years ago

Are you running with any parameters?

Maybe try: -s 5 -l 20

To ask superdeduper to start at base 5 and consider a fragment 20bp long.

If that doesn't give you the behavior you want, could you:

1) provide the log file (e.g. -L dedup.log) 2) Provide an example file with the first ~100 reads

channsoden commented 4 years ago

Both with default start/length and 5/20.

Call$ hts_SuperDeduper -U DKR100_S42_L001_R1_001.fastq.gz -L hts_sd_5-20_test.stats.log -s 5 -l 20

DKR100_S42_L001_R1_001.fastq.gz hts_sd_5-20_test.stats.log hts_sd_test.stats.log

samhunter commented 4 years ago

Hi @channsoden I have been able to replicate the behavior you reported, and I think I found the bug in SuperDeduper. Let me do a little bit of testing and get back to you.

channsoden commented 4 years ago

Thanks, @samhunter. Curious what is going on here.

samhunter commented 4 years ago

Hi again @channsoden. So it turns out this is actually not a "bug" it's a "feature".

When you work with PE data, we require two pieces, one from each read in order to form a unique key for the fragment: R1[start:length] + R2[start:length]

Our thought was that we should require the same level of evidence for a SE read duplicate, so we use: R1[start:length*2]

If you use -s 1 -l 12 it should work. You will actually be filtering for duplicates using 24bp of the read as a key.

We will discuss whether this behavior is what we want to keep going forward (with better documentation), or change it somehow.

channsoden commented 4 years ago

That makes sense. Thank you very much, @samhunter.

I think documenting the behavior and maybe an exception or warning for length overruns would suffice.