koelling / dnacol

Color DNA/RNA bases in terminal output
MIT License
20 stars 3 forks source link

First line in FASTQ file coloured different from rest #1

Open klmr opened 7 years ago

klmr commented 7 years ago

In the following screenshot, produced by gunzip -c file.fastq.gz | dnacol, the first ID line contains highlighted fragments (the N just before the end, as well as the barcode). The subsequent ID lines, by contrast, aren’t highlighted. Why is that?

screen shot 2017-08-03 at 12 43 54

(dnacol 0.3.2)

koelling commented 7 years ago

This is due to the way the auto-detection algorithm works - it only recognizes FASTQ format once it has read the first four lines. Until then it will be operating in text mode, which means it will color any string of DNA that it finds.

I could fix this by reading the first four lines into a buffer and only outputting them once I know what the format is. However, that might lead to very high memory usage if the lines are very long. It would also mean that the first three lines would only be written to the screen once a fourth line has been read, which might cause weird behavior in some edge cases.

So I think for the moment it might be best to leave this as it is, but I'll keep the issue open for now. You can always avoid this by specifiying --format=fastq!