ngless-toolkit / ngless

NGLess: NGS with less work
https://ngless.embl.de
Other
142 stars 24 forks source link

Docs for load_mocat_sample #122

Closed mkuhn closed 5 years ago

mkuhn commented 5 years ago

The documentation for load_mocat_sample in stdlib.md doesn't explain how single files should be named. Reading the source code I think it has to be named

x.pair.1.fq.gz
x.pair.2.fq.gz
x.singles.fq.gz

... but I'm not sure if this is set in stone. (This might also explain the reason behind #120.)

unode commented 5 years ago

With:

sample
├── sample.fq.bz2
├── sample.pair.1.fq.bz2
├── sample.pair.2.fq.bz2
└── sample.singles.fq.bz2

load_mocat_sample gives:

load_mocat_sample found paired-end sample 'sample/sample.pair.1.fq.bz2' - 'sample/sample.pair.2.fq.bz2'
load_mocat_sample found single-end sample 'sample/sample.fq.bz2'
load_mocat_sample found single-end sample 'sample/sample.singles.fq.bz2'

whereas with:

sample
├── sample.fq.bz2
├── sample.pair.1.fq.bz2
├── sample.pair.2.fq.bz2
└── sample.single.fq.bz2  (notice single(s) here)

gives:

load_mocat_sample found paired-end sample 'sample/sample.pair.1.fq.bz2' - 'sample/sample.pair.2.fq.bz2' with singles file 'sample/sample.single.fq.bz2'
load_mocat_sample found single-end sample 'sample/sample.fq.bz2'

so .single.fq.gz would be the correct usage.

Yet, (and correct me if I'm wrong @luispedro), this shouldn't make much of a difference in practice.

The only case where this may make a difference is if using load_mocat_sample and, directly after, using map. Here in the first case the mapper (bwa/minimap2) would be called 3 times, and in the second case only 2.

If calling preprocess() after load_mocat_sample, all pairs and singles should be merged into three files, and if using ngless 0.11.0 or above the number of mapper calls would actually be reduced to 1 thanks to https://github.com/ngless-toolkit/ngless/commit/412531775d15a05e70bc7ffc29f53f3419484af9.

luispedro commented 5 years ago

The only case where this may make a difference is if using load_mocat_sample and, directly after, using map. Here in the first case the mapper (bwa/minimap2) would be called 3 times, and in the second case only 2.

The mapper is now only called once in all cases as NGLess takes care of streaming the reads uncompressed and in interleaved format.