ratschlab / metagraph

Scalable annotated de Bruijn graphs for DNA indexing, alignment, and assembly
http://metagraph.ethz.ch
GNU General Public License v3.0
110 stars 17 forks source link

Using cuttlefish for preprocessing #444

Closed shokrof closed 1 year ago

shokrof commented 1 year ago

Hey, I have been using your great software for almost a year now. I used it to index kmers and their count of 1000 human WGS samples. I am also going to use it to index cattle WGS samples. I am following the advice of @karasikov and I made a snakemake script of the tutorial(https://github.com/ratschlab/counting_dbg/blob/master/scripts.md#prepare-and-build-graph)

snippet from the tutorial:

bsub -J "build_single_${WINDOW_SIZE}[1-2652]%500" \
     -w "filter*" \
     -o $DIR/logs/build_single.lsf \
     -W 4:00 \
     -n 1 -R "rusage[mem=20000] span[hosts=1] select[model==XeonGold_6140]" \
        "id=\\\$(sed -n \${LSB_JOBINDEX}p $DIR/../kingsford.txt); \
        file=$DIR/../kmc_21_filtered/\\\${id}.kmc_suf; \
        /usr/bin/time -v $METAGRAPH build -v \
            -k 21 \
            --mode canonical \
            --count-kmers --count-width 32 \
            --mem-cap-gb 8 \
            --disk-swap ~/metagenome/scratch/nobackup/stripe_1 \
            -p 2 \
            -o $DIR/unitigs/\\\$(basename \\\${file%.kmc_suf}) \
            \\\$file; \
        /usr/bin/time -v $METAGRAPH clean -v \
            --to-fasta --primary-kmers \
            --smoothing-window ${WINDOW_SIZE} \
            -p 2 \
            -o $DIR/unitigs/\\\$(basename \\\${file%.kmc_suf}) \
            $DIR/unitigs/\\\$(basename \\\${file%.kmc_suf}).dbg; \
        rm $DIR/unitigs/\\\$(basename \\\${file%.kmc_suf}).dbg*"

I have a question, I am thinking of using cuttlefish2 to create individual graphs and converting them to unitigs.fasta and kmer counts. cuttlefish2 is extremely fast and will help me scale my workflows. I want to make sure that won't affect the correctness of the metagraph in any way.

Thanks, Moustafa

karasikov commented 1 year ago

Hey Mostafa!

Great to hear from you, hope you're doing great!

Metagraph indexes input sequences (with or without counts), so the result will be correct as long as the input is correct.

Another question is in what format you're going to pass the input to Metagraph. Right now, it only supports fasta/fastq files and KMC counters. I guess cuttlefish can convert to fasta, but how does it store the counts? Or, can it convert to KMC?

shokrof commented 1 year ago

Hey Mikhail, I am planning to generate fasta file and kmer counts in your custom format(kmer_counts.gz). My plan is to make cuttle fish calculate the unitigs(fasta) and run kmc to get the kmer counts. then I will develop a custom script to convert the kmc counts to your kmer count format(kmer counts with the kmer order in the fasta file).

best, Moustafa

karasikov commented 1 year ago

Sounds good. Though, if you have KMC counts, you can easily convert that to Metagraph's contigs by building a weighted graph from KMC counts (see https://metagraph.ethz.ch/static/docs/quick_start.html#construct-weighted-graph) and pulling unitigs (metagraph clean ..., see https://metagraph.ethz.ch/static/docs/quick_start.html#transform-to-sequences).