Closed channsoden closed 2 years ago
Are you running with any parameters?
Maybe try: -s 5 -l 20
To ask superdeduper to start at base 5 and consider a fragment 20bp long.
If that doesn't give you the behavior you want, could you:
1) provide the log file (e.g. -L dedup.log) 2) Provide an example file with the first ~100 reads
Both with default start/length and 5/20.
Call$ hts_SuperDeduper -U DKR100_S42_L001_R1_001.fastq.gz -L hts_sd_5-20_test.stats.log -s 5 -l 20
DKR100_S42_L001_R1_001.fastq.gz hts_sd_5-20_test.stats.log hts_sd_test.stats.log
Hi @channsoden I have been able to replicate the behavior you reported, and I think I found the bug in SuperDeduper. Let me do a little bit of testing and get back to you.
Thanks, @samhunter. Curious what is going on here.
Hi again @channsoden. So it turns out this is actually not a "bug" it's a "feature".
When you work with PE data, we require two pieces, one from each read in order to form a unique key for the fragment: R1[start:length] + R2[start:length]
Our thought was that we should require the same level of evidence for a SE read duplicate, so we use: R1[start:length*2]
If you use -s 1 -l 12 it should work. You will actually be filtering for duplicates using 24bp of the read as a key.
We will discuss whether this behavior is what we want to keep going forward (with better documentation), or change it somehow.
That makes sense. Thank you very much, @samhunter.
I think documenting the behavior and maybe an exception or warning for length overruns would suffice.
Testing SuperDeduper on 26bp single reads it ignores all reads, despite high qualities and lack of Ns. Why is this?