Closed olgabot closed 4 years ago
More reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6720031/
example:
# calculate the signature
sourmash compute –k 21,31,51 \
––scaled 2000 \
––track–abundance \
–o GCF_000005845.2_ASM584v2_genomic.sig \
GCF_000005845.2_ASM584v2_genomic.fna.gz
Relatedly, it would be great to get this added as well, since with --scaled
, the number of hashes will change for every sample since they all have different sequencing depths: https://github.com/nf-core/kmermaid/issues/14
so, when the flag is scaled_sketch_log2
do we have to do a calculation before we do sourmash compute to get that value? since --scaled accepts only integers?
I think it can be like here:
https://github.com/nf-core/kmermaid/blob/4094711d73b19b978ae4ce37e8b339c178079bc7/main.nf#L887
Where the value is fed into bash to do the exponentiation so we don't have to do it ourselves. Or, I'm sure there's probably a way to do it with Nextflow/groovy, I just don't know it.
solved in #81
Right now there's only a "flat rate" option to scale the sequences by a flat number of hashes, but there should be an option to use --scaled which gets e.g. every 1/1000 hashes instead of the same number for all. This accounts for sequencing depth.
Personally, I like using log2 scaling because I think life is log2 scaled and it's much easier to increase to a "natural" amount that's larger. But I know others (@bluegenes included) prefer a little more control, e.g.
--scaled 500
or--scaled 1
. So we'll probably need to support both.Possible parameter names:
--scaled_sketch
--scaled_sketch_log2
.. there's probably better options
While we're at it, should here be a non-log2 option for
--log2sketchsize
? And maybe separated by underscores?