Closed ctb closed 3 years ago
sourmash sketch
- the first argument (dna
, protein
, translate
) tells sourmash what kind of INPUT to expect and what kind of SKETCH to create.
(the below text is now part of https://github.com/dib-lab/sourmash/pull/1283, under doc/sourmash-sketch.md
)
sourmash sketch
documentation[toc]
Most of the commands in sourmash work with signatures, which contain information about genomic or proteomic sequences. Each signature contains one or more sketches, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways.
To create a signature with one or more sketches, you use the sourmash sketch
command. There are three main commands:
sourmash sketch dna
sourmash sketch protein
sourmash sketch translate
The sketch dna
command reads in DNA sequences and outputs DNA sketches.
The sketch protein
command reads in protein sequences and outputs protein sketches.
The sketch translate
command reads in DNA sequences, translates them in all six frames, and outputs protein sketches.
To compute a DNA sketch for a genome, run:
sourmash sketch dna genome.fna
This will create an output file genome.fna.sig
in the current directory, containing a single DNA signature for the entire genome, calculated using the default parameters.
Sourmash can work with unassembled reads; run
sourmash sketch dna -p k=21,k=31,k=51,abund metagenome.fq.gz
to compute three abundance-weighted sketches at k=21, 31, and 51, for the given FASTQ file.
Likewise,
sourmash sketch translate genome.fna
will output a protein sketch in ./genome.fna.sig
, calculated by translating the genome sequence in all six frames and then using the default protein sketch parameters.
And
sourmash sketch protein -p k=25,scaled=500 -p k=27,scaled=250 genome.faa
outputs two protein sketches to ./genome.faa.sig
, one calculated with k=25 and scaled=500, the other calculated with k=27 and scaled=250.
If you want to use different encodings, you can specify them in a few ways; here is a parameter string that specifies a dayhoff encoding for the k-mers:
sourmash sketch protein -p k=25,scaled=500,dayhoff genome.faa
sourmash sketch
auto-detects and reads FASTQ or FASTA files, either uncompressed or compressed with gzip or bzip2. The filename doesn't matter; sourmash sketch
will figure out the format from the file contents.
You can also stream any of these formats into sourmash sketch
via stdin by using -
as the input filename.
By default, sourmash sketch
will produce signatures for each input file. If the file contains multiple FASTA/FASTQ records, these records will be merged into the output signature.
If you specify --singleton
, sourmash sketch
will produce signatures for each record.
If you specify --merge <name>
, sourmash sketch will produce signatures for all input files combined into one.
The output signature(s) will be saved in locations that depend on your input parameters. By default, sourmash sketch
will put the signatures in the current directory, in a file named for the input file with a .sig
suffix. If you specify -o
, all of the signatures will be placed in that file.
sourmash sketch protein
and sourmash sketch translate
output protein sketches by default, but can also use the dayhoff
and hp
encodings. The Dayhoff encoding collapses multiple amino acids into a smaller alphabet so that amino acids that share biochemical properties map to the same character. The hp encoding divides amino acids into hydrophobic and polar (hydrophilic) amino acids, collapsing amino acids with hydrophobic side chains together and doing the same for polar amino acids.
We are still in the process of benchmarking these encodings; ask on the issue tracker if you are interested in updates.
The -p
argument to sourmash sketch
provides parameter strings to sourmash, and these control what signatures and sketches are calculated and output. Zero or more parameter strings can be given to sourmash. Each parameter string produces at least one sketch.
A parameter string is a space-delimited collection that can contain one or more fields, comma-separated.
k=<ksize>
- compute a sketch at this k-mer size; can provide more than one time in a parameter string. Typically ksize
is between 4 and 100.scaled=<int>
- create a scaled MinHash with k-mers sampled deterministically at 1 per <scaled>
value. This controls sketch compression rates and resolution; for example, a 5 Mbp genome sketched with a scaled of 1000 would yield approximately 5,000 k-mers. scaled
is incompatible with num
. See our guide to signature resolution for more information.num=<int>
- create a standard MinHash with no more than <num>
k-mers kept. This will produce sketches identical to mash sketches. num
is incompatible with scaled
. See our guide to signature resolution for more information.abund
/ noabund
- create abundance-weighted (or not) sketches. See Classify signatures: Abundance Weighting for details of how this works.dna
, protein
, dayhoff
, hp
- create this kind of sketch. Note that sourmash sketch dna -p protein
and sourmash sketch protein -p dna
are invalid; please use sourmash sketch translate
for the former.For all field names but k
, if multiple fields in a parameter string are provided, the last one encountered overrides the previous values. For k
, if multiple ksizes are specified a single parameter string, sketches for all ksizes specified are computed.
If a field isn't specified, then the default value for that sketch type is used; so, for example, sourmash sketch dna -p abund
would calculate a sketch with k=31,scaled=1000,abund
. See below for the defaults.
The default parameters for sketches are as follows:
k=31,scaled=1000,noabund
k=10,scaled=200,noabund
k=16,scaled=200,noabund
k=42,scaled=200,noabund
These were chosen by a committee of PhDs as being good defaults for an initial analysis, so, beware :).
More seriously, the DNA parameters were chosen based on the analyses done by Koslicki and Falush in MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation.
The protein, dayhoff, and hp parameters were selected based on unpublished research results and/or magic formulas. We are working on publishing the results! Please ask on the issue tracker if you are curious.
Below are some more complicated sourmash sketch
command lines:
sourmash sketch dna -p k=51
- default to a scaled=1000 and noabund for a k-mer size of 51 (based on moltype/command)sourmash sketch dna -p k=31,k=51,k=21
- compute multiple ksizes, using the defaults otherwisesourmash sketch translate -p k=20,num=500,protein -p k=19,num=400,dayhoff,abund -p k=30,scaled=200,hp
- compute multiple ksizes, moltypes, and scaled/num.Signature files can contain multiple signatures and sketches. Use sourmash sig describe
to get details on the contents of a file.
You can use -o <filename>
to specify a file output location for all the output signatures; -o -
means stdout. This does not merge signatures unless --merge
is provided.
Specify --outdir
to put all the signatures in a specific directory.
Calculating signatures is probably the most time consuming part of using sourmash, and it is the only part that requires access to the raw data. Moreover, the output signatures are generally much smaller than the input data. So, we generally suggest calculating a large set of signatures once.
To support this, sourmash can do two kinds of signature conversion without going back to the raw data.
First, you can downsample num
and scaled
signatures using sourmash sig downsample
. For any sketch calculated with num
parameter, you can decrease that num
. And, for any scaled
parameter, you can increase the scaled
. This will decrease the size of the sketch accordingly; for example, going from a num of 5000 to a num of 1000 will decrease the sketch size by a factor of 5, and going from a scaled of 1000 to a scaled of 10000 will decrease the sketch size by a factor of 10.
(Note that decreasing num or increasing scaled will increase calculation speed and lower the accuracy of your results.)
Second, you can flatten abundances using sourmash sig flatten
. For any sketch calculated with abund
, you can convert it to a noabund
sketch. This will decrease the sketch size, although not necessarily by a lot.
Unfortunately, changing the k-mer size or using different DNA/protein encodings cannot be done on a sketch, and you need to calculate new signatures from the raw data for that.
sourmash sketch
You can use sourmash sig describe
to get detailed information about the contents of a signature file. This can help if you want to see exactly what a particular sourmash sketch
command does!
We try to provide good documentation and error messages, but may not succeed in answer all your questions! So we're happy to help out!
Please post questions on the sourmash issue tracker. If you find something confusing or buggy about the documentation or about sourmash, we'd love to fix it -- for you and for everyone else!
per #1159