ngless-toolkit / ngless

NGLess: NGS with less work
https://ngless.embl.de
Other
142 stars 24 forks source link

Improvements to qcstats (fastq, mapping) formats - explicit behavior? #75

Open unode opened 6 years ago

unode commented 6 years ago

This is related to #74 and to some degree with #73 .

The current formats ({fastq}, {mapping}) are somewhat hard use. Both formats leak internal information (e.g. 1:file preproc.lno10.pairs.1) and require contextual information for interpretation (ref. to line numbers in script.ngl or the filenames).

The fastq format is particularly hairy when mixing samples with a variable number of input files. Since stats are computed per-file, you will end up with a variable number of pair and or single files. Once collect()ed, zero lines will be added to ensure alignment.

The parsing and zero-filling problem could be addressed by using a long format or some other form of tidy data. These are popular with Python and R users that use pandas and tidyverse frameworks.

To address the information leak problem and avoid contextual information dependency, I believe an explicit function call is required. This could be use to define explicit requirements for computing stats but also to provide additional metadata.

A proposal: instead of implicitly calculating statistics on select() #65 , preprocess(), as_reads(), ... one could:

...
mapped = map(reads, reference='hg19')

mappingstats(mapped, name="Human reads mapped")          <----

mapped = select(mapped) using |mr|:
    if mr.flag({unmapped}):
        discard

mappingstats(mapped, name="Human reads kept")            <----

reads = as_reads(mapped)
reads = preprocess(reads, keep_singles=True) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

fastqstats(reads, name="Human reads after strict QC")    <----

collect(qcstats({stats}),                                <----
        ofile='outputs/qcstats.tsv',
        current=sample, allneeded=samples)

this could in turn produce outputs/qcstats.tsv (tab-delimited - visually aligned here):

sample     steptype   step                          identifier     feature                     value
SAMPLE_A   mapping    Human reads mapped                   sam   reference                      hg19  (if internal reference, /path otherwise)
SAMPLE_A   mapping    Human reads mapped                   sam       total                   8348246
SAMPLE_A   mapping    Human reads mapped                   sam     aligned                   5678246
SAMPLE_A   mapping    Human reads mapped                   sam        file                     /path  (/path or 'step + identifier' if exists)
SAMPLE_A   mapping    Human reads mapped                   sam         ...                       ...  (other fields in mapping output)
SAMPLE_A   mapping    Human reads kept                     sam       total                   5678246
SAMPLE_A   mapping    Human reads kept                     sam     aligned                   5678246
SAMPLE_A   mapping    Human reads kept                     sam        file    Human reads mapped-sam  (/path or 'step + identifier' if exists)
SAMPLE_A   mapping    Human reads kept                     sam         ...                       ...
SAMPLE_A   fastq      Human reads after strict QC       pair.1        file      Human reads kept-sam
SAMPLE_A   fastq      Human reads after strict QC       pair.1     numSeqs                   2839123
SAMPLE_A   fastq      Human reads after strict QC       pair.2     numSeqs                   2839123
SAMPLE_A   fastq      Human reads after strict QC       single     numSeqs                         0  (always present even if zero or non-existing)
SAMPLE_A   fastq      Human reads after strict QC          ...         ...                       ...
SAMPLE_B   mapping    Human reads mapped                   sam       total                   6412432
SAMPLE_B   mapping    Human reads mapped                   sam     aligned                   4327644
SAMPLE_B   mapping    Human reads mapped                   sam         ...                       ...
SAMPLE_B   mapping    Human reads kept                     sam       total                   4327644
SAMPLE_B   mapping    Human reads kept                     sam     aligned                   4327644
SAMPLE_B   mapping    Human reads kept                     sam         ...                       ...
SAMPLE_B   fastq      Human reads after strict QC       pair.1        file      Human reads kept-sam
SAMPLE_B   fastq      Human reads after strict QC       pair.1     numSeqs                         0  (always present)
SAMPLE_B   fastq      Human reads after strict QC       pair.2     numSeqs                         0  (always present)
SAMPLE_B   fastq      Human reads after strict QC       single     numSeqs                   4327644  (all single end)
SAMPLE_B   fastq      Human reads after strict QC          ...         ...                       ...

See notes on the rightmost side of above table.

Aspects not discussed: 1) implicit dependency between *stats() functions and qcstats({stats}) - should this be explicit too? 2) impact on internal hashing - dependency on user specified name= for qcstats().

unode commented 6 years ago

Another perhaps simpler alternative is to add a stats_label="Human reads kept" argument to all functions that can generate statistics.

Passing this argument would also explicitly specify which functions should or shouldn't produce statistics.

luispedro commented 6 years ago

Another perhaps simpler alternative is to add a stats_label="Human reads kept" argument to all functions that can generate statistics.

I like this idea, it could be made more general, namely label as it could potentially be used elsewhere.

But, I would always produce statistics. This would just add extra convenience.