ngless-toolkit / ngless

NGLess: NGS with less work
https://ngless.embl.de
Other
142 stars 24 forks source link

Reading in fastq.gz files #115

Closed luispedro closed 4 years ago

luispedro commented 5 years ago

Originally posted on https://github.com/ngless-toolkit/ngless2018benchmark/issues/2

NGLess load_mocat_sample does not ignore empty lines at the end of a fastq.gz file. Not sure if this is a bug or just something it was not meant to do

This is happening on both versions 1.0.1 and 0.7.1

Exiting after fatal error while loading and running script
Data Error (the input data did not conform to ngless' expectations)
Number of input lines in FastQ file is not a multiple of 4
luispedro commented 5 years ago

My first impression is that NGLess is correct here. Empty lines in FastQ files should be considered a misformed file.

Is it normal to see this "in the wild"? Is it really just the special case that lines are present at the very end of the file?

unode commented 5 years ago

I've seen cases where empty lines appear in the middle of the file if the sequence has length 0. Not really common (or useful) practice but some software can produce this as part of quality control/trimming.

The (python) FastQ libraries I used at the time managed to parse this as an empty sequence.

luispedro commented 5 years ago

@unode: Good point! As long as you still have a header, then empty line, +, then empty line. Indeed, this should be a well-formed empty sequence. A bit strange, but I can see how it would emerge and it makes sense to parse it correctly

@waakanni: is this the issue you are observing?

waakanni commented 5 years ago

@luispedro Yes, the case I observed however is to do with an empy line present at the very end of the fastq.gz file

luispedro commented 5 years ago

OK, is it just one empty line? Where do these samples come from?

I'm not against special casing this particular thing and issuing just a warning.

waakanni commented 5 years ago

For my samples it is just the single line at the end of the file.

I can't say for definite the origin of the samples as I am not responsible for them.

waakanni commented 5 years ago

Sorry for the delayed response but to follow up on your question @luispedro.

I have confirmed with Micheal that these files were generated a long time ago by MOCAT and that we don't really expect more files with the empty line in the future.

Thank you

luispedro commented 4 years ago

I think this can be closed. It's not clear that we should really consider it a bug.