tecangenomics / nudup

NuDup -- Marks/removes duplicate molecules based on the molecular tagging technology used in Tecan products.
http://www.tecangenomics.com
GNU Lesser General Public License v3.0
14 stars 9 forks source link

Not valid Read FASTQ File, ill-formatted Index sequences, possibly spaces #16

Closed sklages closed 6 years ago

sklages commented 6 years ago

I have a RRBS dataset from two different runs with different run lengths, SR50, SR75. Index+UMI = 6+6. Run data has been demultiplexed so that the UMI is located in R2 as a read ('i6y6n')

Data from both runs have been merged, both R1 files, both R2 files.

The merged R1 file has been mapped to mm9 using bsmap and should be deduplicated with nugentechnologies-nudup-468c62e/nudup.py.

The error I get is:

2018-03-21 09:25:27,646 [     INFO] - Deduplicating NuGEN single end reads...
sed: couldn't write 193 items to stdout: Broken pipe
2018-03-21 09:25:28,554 [     INFO] - Not valid Read FASTQ File, ill-formatted Index sequences, possibly spaces
2018-03-21 09:25:28,555 [    ERROR] - No Valid molecular tag sequence information found in FASTQ header name, please provide a valid Index FASTQ file

The actual error results from erroneously looking in the header for the UMI .. not very helpful error messages :-(

So there is probably something I missed .. it works fine with unmerged data. Any idea where to start looking for the problem?

sklages commented 6 years ago

It is so simple ... I just needed to extend the allowed fastq extension settings: ALLOWED_FASTQ = ['.fq','.fastq.gz','fq.gz'].

So Not valid Read FASTQ File, ill-formatted Index sequences refers to the filename not the content of the files!

You should really work on more meaningful error messages ... :unamused:

mlovci commented 6 years ago

Thank you for reporting this @sklages!