marbl / meryl

A genomic k-mer counter (and sequence utility) with nice features.
119 stars 14 forks source link

database file parameters used in count output are too large #30

Open brianwalenz opened 1 year ago

brianwalenz commented 1 year ago

The prefixSize used for writing count output is too large when inputs are large too.

https://github.com/marbl/meryl/blob/master/src/meryl/merylOp-countThreads.C#L404

Sets the output prefix based on the 'optimal' prefix used for counting. It works fine for moderate kmer sizes (e.g., 22) but when larger (e.g., 28) database chunks are too big for merging.

Example:

prefix     # of   struct   kmers/    segs/      min     data    total
  bits   prefix   memory   prefix   prefix   memory   memory   memory
------  -------  -------  -------  -------  -------  -------  -------
    14    16 kP    66 MB    98 kM   130  S    64 MB  8320 MB  8386 MB
    15    32 kP   117 MB    49 kM    64  S   128 MB  8192 MB  8309 MB
    16    64 kP   217 MB    24 kM    31  S   256 MB  7936 MB  8153 MB  Best Value!
    17   128 kP   420 MB    12 kM    16  S   512 MB  8192 MB  8612 MB
    18   256 kP   824 MB  6314  M     8  S  1024 MB  8192 MB  9016 MB
> meryl dumpIndex 001.meryl
Opened '001.meryl'.
  magic          0x646e496c7972656d33302e765f5f7865 'merylIndex__v.03'
  prefixSize     16
  suffixSize     40
  numFilesBits   6 (64 files)
  numBlocksBits  10 (1024 blocks)

But after merging, the prefix is more reasonable (though this is, iirc, a fixed hardcoded size). Merging seems to want to use around 1 GB per input database, not sure why.

> meryl dumpIndex 00x.meryl/
Opened '00x.meryl/'.
  magic          0x646e496c7972656d33302e765f5f7865 'merylIndex__v.03'
  prefixSize     12
  suffixSize     44
  numFilesBits   6 (64 files)
  numBlocksBits  6 (64 blocks)