open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

Fix corrupt dedup output for inputs with quotes #194

Closed hkariti closed 10 months ago

hkariti commented 11 months ago

When running dedup on a pairsum files that includes quotes in the QUAL field, the result would be a corrupt file. The to_csv method would quote the entire column, and would also escape the quote with a second quote. This results in a file that has QUAL and SEQ of different lengths. To fix this, we ask DataFrame.to_csv to never quote the output.

Phlya commented 10 months ago

Looks good, thank you for the fix! I am curious - where do you encounter such files though?

hkariti commented 10 months ago

Heh. It was a regular sequencing output, nothing fancy. It was done by an external company and I think they're using some new machine. Maybe it decided to utilize the full range of QUAL values for this one :)