Add --scaled option for sketches

olgabot commented 4 years ago

Right now there's only a "flat rate" option to scale the sequences by a flat number of hashes, but there should be an option to use --scaled which gets e.g. every 1/1000 hashes instead of the same number for all. This accounts for sequencing depth.

Personally, I like using log2 scaling because I think life is log2 scaled and it's much easier to increase to a "natural" amount that's larger. But I know others (@bluegenes included) prefer a little more control, e.g. --scaled 500 or --scaled 1. So we'll probably need to support both.

Possible parameter names:

--scaled_sketch
--scaled_sketch_log2

.. there's probably better options

While we're at it, should here be a non-log2 option for --log2sketchsize ? And maybe separated by underscores?

olgabot commented 4 years ago

More reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6720031/

example:

# calculate the signature
sourmash compute –k 21,31,51 \
––scaled 2000 \
––track–abundance \
–o GCF_000005845.2_ASM584v2_genomic.sig  \
GCF_000005845.2_ASM584v2_genomic.fna.gz

olgabot commented 4 years ago

Relatedly, it would be great to get this added as well, since with --scaled, the number of hashes will change for every sample since they all have different sequencing depths: https://github.com/nf-core/kmermaid/issues/14

pranathivemuri commented 4 years ago

so, when the flag is scaled_sketch_log2 do we have to do a calculation before we do sourmash compute to get that value? since --scaled accepts only integers?

olgabot commented 4 years ago

I think it can be like here:

https://github.com/nf-core/kmermaid/blob/4094711d73b19b978ae4ce37e8b339c178079bc7/main.nf#L887

Where the value is fed into bash to do the exponentiation so we don't have to do it ourselves. Or, I'm sure there's probably a way to do it with Nextflow/groovy, I just don't know it.

pranathivemuri commented 4 years ago

solved in #81

nf-core / kmermaid

Add --scaled option for sketches #78