liaoherui / StrainScan

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers
https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-023-01615-w
MIT License
38 stars 5 forks source link

strainscan_build failing due to hclsMap_95.txt file having extra lines #24

Open kheber opened 3 weeks ago

kheber commented 3 weeks ago

In creating a custom database using strainscan_build version 1.0.14 from bioconda, I get the following error:

2024-09-21 18:22:29,037 - Constructing matrix with dashing (jaccard index)
2024-09-21 18:22:33,708 - Hierarchical clustering
Traceback (most recent call last):
  File "/data/shared_resources/conda_local/envs/strainscan/bin/strainscan_build", line 10, in <module>
    sys.exit(main())
  File "/data/shared_resources/conda_local/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 117, in main
    cls_file, cls_res)
  File "/data/shared_resources/conda_local/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/select_rep.py", line 44, in pick_rep
    clsa.append(int(ele[0]))
ValueError: invalid literal for int() with base 10: 'WARNING:'

Looking at the tail of hclsMap_95.txt, I see the following:

1   1   MIKI-NS13
2   1   MIKI-NS15
WARNING:    0   
ignoring    0   
environment 0   
value   0   
of  0   
R_HOME  0

I think this is what is causing the problem.

liaoherui commented 3 weeks ago

Hi, thanks for using StrainScan!

This issue might be related to a problematic filename. Could you share the filename list with me? Alternatively, you can send some of your input genomes for debugging, and I'll test the code to find a solution.

kheber commented 3 weeks ago

I have attached the list of genome filenames. They come from the CAMI challenge "strain-madness" dataset, which I downloaded from here.

I did manage to find a temporary fix by providing that hclsMap_95.txt file with the -c option after deleting the problematic lines. The second column added up to the number of genomes I had provided, so I felt it would be ok to do. Do you think it would be valid for me to go forward using the results with what I did?

genome_filenames.txt

liaoherui commented 3 weeks ago

I think you can try. If it completes without errors, it should be valid. Still wondering why this occurs in your hclsMap_95.txt file...

WARNING:    0
ignoring    0   
environment 0   
value   0   
of  0   
R_HOME  0