tlemane / kmtricks

modular k-mer count matrix and Bloom filter construction for large read collections
GNU Affero General Public License v3.0
72 stars 7 forks source link

kmtricks fails when only one path by sample is provided #5

Closed sam217pa closed 3 years ago

sam217pa commented 3 years ago

Hi,

I was tempted into trying kmtricks after the nice talk of @pierrepeterlongo at DSB2021.

My use case is the following. I'm trying to find patterns of shared k-mer in a somewhat large genomic project: 204GB of uncompressed fasta files corresponding to ~480 scaffolded assemblies of butterfly, wasps and flies; roughly about 199G bp in total; I don't know about the unique k-mer count by now, but it should probably be less than the total TARA ocean project, so I guess kmtricks can do the job ;)

I have only one file by specimen, and so my file of files looks like:

# fof.txt
sample1 : sample1.fna ! 1
sample2 : sample2.fna ! 1

This triggers the following error:

Traceback (most recent call last):
  File "kmtricks.py", line 1080, in <module>
    main()
  File "kmtricks.py", line 1072, in main
    pool.exec()
  File "kmtricks.py", line 866, in exec
    self.run_ready()
  File "kmtricks.py", line 929, in run_ready
    cmd.run()
  File "kmtricks.py", line 353, in run
    self.preprocess()
  File "kmtricks.py", line 586, in preprocess
    raise FileExistsError(f'{repart_file} doesn\'t exists.')
FileExistsError: kmdir/storage/partition_storage_gatb/minimRepart.minimRepart doesn't exists.

(I tried with both conda installed kmtricks and compiled from source.)

It works if I trick it into parsing twice the same file:

# fof.txt
sample1 : sample1.fna ; sample1.fna ! 1
sample2 : sample2.fna ; sample2.fna ! 1

The command ran was taken from your benchmarks here:

set -euo pipefail
rm -rf kmdir

kmtricks.py --verbose --debug run \
           --file fof.txt \
           --run-dir kmdir \
           --kmer-size 20 \
           --nb-cores 8 \
           --nb-partitions 1 \
           --count-abundance-min 1 \
           --recurrence-min 1 \
           --mode bf_trp \
           --hasher sabuhash \
           --max-hash 1000000 \
           --split howde \
           --lz4 \
           --max-count 256 \
           --max-memory 8000 \
           --log-files repart,superk,count,merge,split

Thanks for kmtricks anyway, it looks promising!

tlemane commented 3 years ago

Hi,

Thank you for trying kmtricks.

It seems that the minimizer repartition is missing. I can't reproduce this bug using the same command line on small inputs. After a conda installation, are you able to run example1.sh or example2.sh at tests/kmtricks ?

Not related to the bug, few words about parameters: The parameter --mode bf_trp triggers hash counting mode in order to build Bloom filters. If I understood, you rather need a k-mer matrix ? If so, I suggest something like this:

kmtricks.py run \
           --file fof.txt \
           --run-dir kmdir \
           --kmer-size 20 \
           --nb-cores 8 \
           --count-abundance-min 1 \
           --recurrence-min 1 \      
           --mode ascii \     
           --lz4 \
           --log-files repart,superk,count,merge,split
sam217pa commented 3 years ago

Thanks, that was quick 👍

Thank you for pointing out to your test files, should have done this beforehand ... Actually I managed to get it running. Turns out the culprit was my file names; I replaced them with "sample1.fna" for clarity above but they actually contain - characters (05-SRNP-56838: 05-SRNP-56838.fna).

Changing - to _ fixed it.

My guess is that the second group in the regex: https://github.com/tlemane/kmtricks/blob/021bf9eb3270d100c504a53d818bd607d0750e1b/kmtricks.py#L772 could include \-?

I'll let you know how my kmtricks venture goes ;)

EDIT: changing the regex does fixes the problem.

tlemane commented 3 years ago

Thanks for the fix, your pr is now merged. It will be included in the next release. Feel free to reopen an issue if you encounter any problems. Feedback is highly appreciated :+1: