Open unode opened 6 years ago
Another perhaps simpler alternative is to add a stats_label="Human reads kept"
argument to all functions that can generate statistics.
Passing this argument would also explicitly specify which functions should or shouldn't produce statistics.
Another perhaps simpler alternative is to add a stats_label="Human reads kept" argument to all functions that can generate statistics.
I like this idea, it could be made more general, namely label
as it could potentially be used elsewhere.
But, I would always produce statistics. This would just add extra convenience.
This is related to #74 and to some degree with #73 .
The current formats (
{fastq}
,{mapping}
) are somewhat hard use. Both formats leak internal information (e.g.1:file preproc.lno10.pairs.1
) and require contextual information for interpretation (ref. to line numbers inscript.ngl
or the filenames).The
fastq
format is particularly hairy when mixing samples with a variable number of input files. Since stats are computed per-file, you will end up with a variable number ofpair
and orsingle
files. Oncecollect()
ed, zero lines will be added to ensure alignment.The parsing and zero-filling problem could be addressed by using a long format or some other form of tidy data. These are popular with Python and R users that use
pandas
andtidyverse
frameworks.To address the information leak problem and avoid contextual information dependency, I believe an explicit function call is required. This could be use to define explicit requirements for computing stats but also to provide additional metadata.
A proposal: instead of implicitly calculating statistics on
select()
#65 ,preprocess()
,as_reads()
, ... one could:this could in turn produce
outputs/qcstats.tsv
(tab-delimited - visually aligned here):See notes on the rightmost side of above table.
Aspects not discussed: 1) implicit dependency between
*stats()
functions andqcstats({stats})
- should this be explicit too? 2) impact on internal hashing - dependency on user specifiedname=
forqcstats()
.