swarris / Pacasus

Correction of palindromes in long reads from PacBio and Nanopore
MIT License
14 stars 3 forks source link

Missing sequences in output file #16

Open Mailinnia opened 4 years ago

Mailinnia commented 4 years ago

Hi,

So I have been running Pacasus on some of my data. And I seem to have lost sequences in the output file, which appear in the input file I can see that Pacasus adds the sequences as hits in the log, however no 'formatting hit' ever appears.

image

In contrast to a sequence which both appears in the input and output file:

image

How do I solve this issue?

swarris commented 4 years ago

I'm not completely sure why this happens. Could be a bug in the filter step in which reads are removed because they are too short. Could you share the reads for which this happens? So I can investigate further.

Thanks!

Mailinnia commented 4 years ago

Sure. How do I send you the files? I can send you both the missing reads (fastq) and the log report. We tried running Pacasus on the missing reads again, but there was not output file.

We have run Pacasus on more of our data, and we see the same pattern with reads going missing.

swarris commented 4 years ago

Cool. You can either post a download link here (dropbox, etc) or send the files / download link to: s.warris@gmail.com

ericsong commented 4 years ago

I was seeing this as well and was able to recover the missing reads by adding --minimum_read_length=0.

The interesting thing is that it doesn't seem to just be dropping reads below a certain length. In one of my cases a 257bp read was split into reads of lengths [5, 6, 9, 10, 14, 20, 96, 97] but the 96 length read was the only one that was in the original output.

swarris commented 4 years ago

Thanks for the update. I have not had the time to debug. This information will definitely help in tracking down the issue.