refresh-bio / DSRC

DSRC - DNA Sequence Reads Compressor
http://sun.aei.polsl.pl/dsrc/
55 stars 19 forks source link

Data corruption #28

Open i-strielkov opened 2 years ago

i-strielkov commented 2 years ago

Hi, we have been using your great tool for several years and saved as a lot of disc space! However, recently we have encountered and error that appears during DSRC encoding. Algorithm occasionally skips a number of reads at seemingly random position and then continues. The resulting file contain artifacts like this:

@L183:321:CAFVJANXX:6:2213:18462:88964 3:N:0:0
TATAAATGGATTCTCTTTGTCCATGATCACAAAATAAGAAT@L183:321:CAFVJANXX:6:2213:5699:93216 3:N:0:0

Renaming the reads solves the problem. Do you happen to know what may cause such issues?

The problems were encountered with this public dataset: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-10175/ In particular, the problem can be reproduced with this file: http://ftp.sra.ebi.ac.uk/vol1/run/ERR539/ERR5396174/AML_low_input_AAAACT_r2.fq.gz

Many thanks for any information in advance, Best, Ievgen

earonesty commented 2 years ago

if you have reads that are named the same they might be seen as dups, right? not sure the algo takes dup-read inputs well (which should never happen)

ggoussarov-evotec commented 2 years ago

Hi, I have been working @i-strielkov on this, using the fastq files that were linked. After poking at the settings for a while, I have identified that using a buffer size that is not the default (I tried -b11 and -b12) reproducibly changes which lines get corrupted, and if set to be larger than the file size (I tried -b500) removes the corrupted output. To me, this indicates that the issue is probably related to that, rather than issues with with the file itself.

In addition, I have also verified that all read names are indeed unique, since this was proposed as a potential cause.