liaoherui / StrainScan

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers
https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-023-01615-w
MIT License
32 stars 4 forks source link

Bugs with database construction, strain scanning #1

Closed jdwinkler-lanzatech closed 1 year ago

jdwinkler-lanzatech commented 1 year ago

Hi,

Thank you for your work on this tool, it looks very interesting. I have been trying it out and noticed a few problems:

  1. When generating a database of very similar strains (differing only by SNPs) with a .fasta extension, I get this error:
Traceback (most recent call last):
  File "/opt/conda/envs/strainscan/bin/strainscan_build", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 161, in main
    Build_tree.build_tree([cls_res+'/distance_matrix.txt',cls_res+'/hclsMap_95_recls.txt',out_dir+'/Tree_database',31,params])
  File "/opt/conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 266, in build_tree
    cls_dist, mapping, tree, depths, depths_mapping = hierarchy(fna_mapping, dist)
  File "/opt/conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 71, in hierarchy
    for i in tree_relationship[parent]:
KeyError: 1
  1. It looks like StrainScan only supports uncompressed reads because of JellyFish not being able to natively handle .gz files. Based on the following lines of code:

    os.system(dir_jf + " count -m 31 -s 100M -t 8 --if " + db_dir+"/kmer.fa -o temp_"+uid+".jf " + fq_path)
    os.system(dir_jf+" dump -c temp_"+uid+".jf > temp_"+uid+".fa")
    os.system("rm temp_"+uid+".jf")

It looks like you could just check if the input data has a ".gz" extension, then zcat the data into jellyfish to get the kmer counts. If that is too bothersome, an CLI option to provide the kmer counts would also work perfectly.

Thanks for any feedback you might have!

liaoherui commented 1 year ago

Hi, sorry for the late reply.

  1. I have fixed the first problem. You can try the newest version of StrainScan on Github.

  2. We are still working on the second suggestion and will upload a new version to support the compressed reads.

BTW, when strains are highly similar (only several SNPs), you may need to consider optimizing the parameter (e.g "-s") to achieve ideal performance.

jdwinkler-lanzatech commented 1 year ago

Thank you!