ngless-toolkit / ngless

NGLess: NGS with less work
https://ngless.embl.de
Other
142 stars 24 forks source link

Add interleaved FastQ support #55

Closed unode closed 5 years ago

unode commented 6 years ago

The format is not formally described but is used in the wild. On a quick search there was no mention on how 'singles' are handled. Possibilities include:

  1. Output .1 followed by .2 and add singles at the end of the file
  2. Tolerate unpaired singles in the middle of the file.

The second variant is more versatile (e.g. for filter()) as it doesn't require a second file to hold reads as they are being processed.

luispedro commented 6 years ago

Yeah, these "vaguely defined" file formats are a PITA.

In NGLess, I always want to err on the side of strictness of output at the cost of computational efficiency (better wait a few more minutes than waste a week debugging a weird file format error), though, so IMHO format variant 1 is better.

unode commented 6 years ago

Just to clarify, 1. actually means:

@A.1
@A.2
@B.1
@B.2
...

and not

@A.1
@B.1
...
@A.2
@B.2

I have seen the last variant (concatenated) but it's a PITA to work with if you actually want to extract information from pairs.

luispedro commented 6 years ago

bwa now supports this as an input format, so if we'd use it internally when calling it, it could save having to do two calls to it (which can be have IO costs as it implies that the databases are loaded twice).

luispedro commented 6 years ago

Not yet closing as I think that to fully reap the benefits would mean to use in bwa calling and in external module calling as well.