marbl / meryl

A genomic k-mer counter (and sequence utility) with nice features.
108 stars 11 forks source link

Use sequence to query meryl db #47

Open Adamtaranto opened 1 month ago

Adamtaranto commented 1 month ago

I want to generate a db of all kmers and their counts for a reference genome using meryl count, then for thousands of small (~1-5 kbp) sequences I want to extract all kmers and find their counts in the genome kmer db.

Is there a way to provide a short sequence as an argument to meryl to query its kmers against an existing db?

It seems like it would not be efficient to run meryl count on all of the short seqs and have to clean up the .meryl files between each query.

brianwalenz commented 1 month ago

That sounds like a job for meryl-lookup:

usage: meryl-lookup <report-type> \
         -sequence <input1.fasta> [<input2.fasta>] \
         -output   <output1>      [<output2>] \
         -mers     <input1.meryl> [<input2.meryl>] [...] [-estimate] \
         -labels   <input1name>   [<input2name>]   [...]

  Compare kmers in input sequences against kmers in input meryl databases.

  Input sequences (-sequence) can be FASTA or FASTQ, uncompressed, or
  compressed with gzip, xz, or bzip2.

  To compute and report only estimated memory usage, add option '-estimate'.

  Report types:
    Run `meryl-lookup <report-type> -help` for details on each method.

  -bed:
     Generate a BED format file showing the location of kmers in
     any input database on each sequence in 'input1.fasta'.
     Each kmer is reported in a separate bed record.

  -bed-runs:
     Generate a BED format file showing the location of kmers in
     any input database on each sequence in 'input1.fasta'.
     Overlapping kmers are combined into a single bed record.

  -wig-count:
     Generate a WIGGLE format file showing the multiplicity of the
     kmer starting at each position in the sequence, if it exists in
     an input kmer database.

  -wig-depth:
     Generate a WIGGLE format file showing the number of kmers in
     any input database that cover each position in the sequence.

  -existence:
     Generate a tab-delimited line for each input sequence with the
     number of kmers in the sequence, in the database and common to both.

  -include:
  -exclude:
     Copy sequences from 'input1.fasta' (and 'input2.fasta') to the
     corresponding output file if the sequence has at least one kmer
     present (include) or no kmers present (exclude) in 'input1.meryl'.