sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
15 stars 2 forks source link

fastgather scaled size does not use --scaled but scale of the signature #441

Open AnneliektH opened 2 months ago

AnneliektH commented 2 months ago

Hey, I am running into some differences when running gather v fastgather, followed by tax metagenome. I suspect the issue is that fastgather is not using the --scaled parameter to do the calculations, but instead using the scale of the signature it is running on.

All was run in /group/ctbrowngrp2/scratch/annie/2024-pigparadigm/results/sourmash The comparisons were ran on the same metagenome, against gtdb-214-reps, with a scale of 1000 and a k 21.

# gather
sourmash gather ./sketches/read_s100/ERR3211750.sig.gz \
/group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k21.zip \
-k 21 --scaled 1000 -o ERR3211750.gather.csv

sourmash tax metagenome -g ERR3211750.gather.csv \
-t /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214.lineages.csv \
-o ERR3211750.tax.g.csv

# fastgather
sourmash scripts fastgather \
./sketches/read_s100/ERR3211750.sig.gz \
/group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k21.zip
-k 21 --scaled 1000 -c 24 -o ERR3211750.fg.csv

sourmash tax metagenome -g ERR3211750.fg.csv \
-t /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214.lineages.csv \
-o ERR3211750.tax.fg.csv

Gather/fastgather outputs are almost identical, except for the scaled value and thus also the total_weighted_hashes, which is 10x larger in the fastgather output

out:

# gather
intersect_bp    f_orig_query    f_match f_unique_to_query   f_unique_weighted   average_abund   median_abund    std_abund   filename    name    md5 f_match_orig    unique_intersect_bp gather_result_rank  remaining_bp    query_filename  query_name  query_md5   query_bp    ksize   moltype scaled  query_n_hashes  query_abundance query_containment_ani   match_containment_ani   average_containment_ani max_containment_ani potential_false_negative    n_unique_weighted_found sum_weighted_found  total_weighted_hashes
2978000 0.004141869 0.927436936 0.004141869 0.027999557 16.03324379 12  50.99034564 /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k21.zip    GCF_000834015.1 Prevotella pectinovora strain=P4-76, ASM83401v1 1afcad67735bd1d8888bcb7af73eda3c    0.927436936 2978000 0   716021000   ../atlas/atlas_ERR3211750/ERR3211750/sequence_quality_control/ERR3211750_QC_R1.fastq.gz ERR3211750  c96db165    718999000   21  DNA 1000    718999  TRUE    0.770075244 0.99641926  0.883247252 0.99641926  FALSE   47747   47747   1705277

# fastgather
intersect_bp    f_orig_query    f_match f_unique_to_query   f_unique_weighted   average_abund   median_abund    std_abund   filename    name    md5 f_match_orig    unique_intersect_bp gather_result_rank  remaining_bp    query_filename  query_name  query_md5   query_bp    ksize   moltype scaled  query_n_hashes  query_abundance query_containment_ani   match_containment_ani   average_containment_ani max_containment_ani potential_false_negative    n_unique_weighted_found sum_weighted_found  total_weighted_hashes
2978000 0.004141869 0.927436936 0.004141869 0.002789549 16.03324379 12  50.99034564 signatures/1afcad67735bd1d8888bcb7af73eda3c.sig.gz  GCF_000834015.1 Prevotella pectinovora strain=P4-76, ASM83401v1 1afcad67735bd1d8888bcb7af73eda3c    0.927436936 2978000 0   716021000   ../atlas/atlas_ERR3211750/ERR3211750/sequence_quality_control/ERR3211750_QC_R1.fastq.gz ERR3211750  752364cae42925c75cfb235c1f543ef7    719175300   21  DNA 100 7191753 TRUE    0.770075244 0.99641926  0.883247252 0.99641926  47747   47747   17116384        

Anyway, after running tax metagenome, there is a 10x difference at f_weigthed_at_rank. I think the one from gather is the correct one. I'll multiply the fastgather f_weigthed_at_rank by 10 for now, but I thought that maybe this is something that can be fixed:

# gather
query_name  rank    fraction    lineage query_md5   query_filename  f_weighted_at_rank  bp_match_at_rank    query_ani_at_rank   total_weighted_hashes
ERR3211750  superkingdom    0.323352327 d__Bacteria c96db165    ../atlas/atlas_ERR3211750/ERR3211750/sequence_quality_control/ERR3211750_QC_R1.fastq.gz 0.507099433 232490000   0.947657137 1705277

# fastgather
query_name  rank    fraction    lineage query_md5   query_filename  f_weighted_at_rank  bp_match_at_rank    query_ani_at_rank   total_weighted_hashes
ERR3211750  superkingdom    0.323352327 d__Bacteria 752364cae42925c75cfb235c1f543ef7    ../atlas/atlas_ERR3211750/ERR3211750/sequence_quality_control/ERR3211750_QC_R1.fastq.gz 0.050521477 232490000   0.947657137 17116384
ctb commented 2 months ago

thanks for reporting! I'll look into it when I get a chance :)

ctb commented 3 weeks ago

side note - please try to put these in the branchwater plugin repo instead of the branchwater repo, please! I only found this because I keep these in my inbox until I get a chance to look at them :)

ctb commented 3 weeks ago

(you can also just put 'em in the sourmash issue tracker. doesn't need to be the branchwater plugin repo.)