sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
464 stars 78 forks source link

If there is a command or way to annotate individual contigs without creating many files? #2592

Open yuzie0314 opened 1 year ago

yuzie0314 commented 1 year ago

Hi SourMash authors,

We recently used your tool to annotate our reads, assembled contigs, and assembled genomes, and found that if we pool all contigs into one fasta, we cannot see the annotate results from each contig.

The following is the commands we was using under nextflow environment.

sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund ${contigs} -o ${id}_fa.sig.gz --name ${id}_fa

sourmash gather ${contigs_sig} ${gtdbtk_db} -o ${id}_fa.x.gtdb.csv --save-matches ${id}_fa_matches.zip

sourmash tax metagenome \\
    --gather-csv ${gtdb_matches} \\
    --taxonomy ${gtdbtk_list} \\
    --output-format krona \\
    --rank species > ${id}_fa.krona

    sed -i -e '/d__/d;/p__/d;/c__/d;/o__/d;/f__/d;/g__/d;/s__/d' ${meta}_${params.mode}.krona

thanks in advance, Yuzie

ctb commented 1 year ago

hi @yuzie0314,

the short answer is that you can add --singleton to the sketch dna command and it will create a single file containing many sketches. If you use -o ${contigs}.zip you'll also have a file that is more convenient for sourmash to load in a few ways (same functionality, just might be faster).

However, you'll also need to change gather to multigather and use --query ${contigs_file} and --db ${gtdbtk_db}, because gather only works on one input sketch.

You will get many different output files, I'm afraid.

I hope this helps - we don't have a super convenient way to do exactly what you want done, unfortunately!

ctb commented 1 year ago

ref https://github.com/sourmash-bio/sourmash/issues/2564

yuzie0314 commented 10 months ago

Hi there @ctb the sourmash author, Sorry for the late response. Our team tried to test on your commands so created the following snippet of codes, but the results are not we expected.

# generate signature for each contig
sourmash sketch dna \
    -p k=21,k=31,k=51,scaled=1000,abund \
    ${contig} \
    -o ${id}_fa.sig.zip \
    --name ${id}_fa \
    --singleton

# use multigather to see the results from sourmash
mkdir -p ${id}_test_contig
sourmash multigather \
    --query ${id}_fa.sig.zip \
    --ksize 31 \
    --db ${gtdbtk_db} \
    --output-dir ${id}_test_contig \
    --estimate-ani-ci \
    --dna

Actually, we expected that if there is an efficient to annotate sequence at contig level. Namely, contig_1 : bac1, contig_2 : bac2, and their corresponding abundance evaluation or information like the output csv file from gather or multigather commands. We don't want to produce one fasta for one contig from a genome, and run your sketch and gather commands, which might take very long time when scanning reference database, any ideas??

thanks in advance, Yuzie

ctb commented 10 months ago

NOTE: I posted a recipe here - https://github.com/sourmash-bio/sourmash/issues/2816#issuecomment-1770755051 - will respond to above question soon, I hope.