liaoherui / StrainScan

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers
https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-023-01615-w
MIT License
32 stars 4 forks source link

How to create custom databases #11

Open NamenloseHeldin opened 9 months ago

NamenloseHeldin commented 9 months ago

Hi! I would like to use your tool with my own reference database.

I installed using option 2 after reading through #10

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan

conda env create -f environment_candidate.yaml
conda activate strainscan

chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

However, when I try to run the StrainScan_build.py command with your test data, I get the following error: ~/StrainScan$ python StrainScan_build.py -i Test_genomes -o DB_Small

2023-10-06 12:14:33,705 - Constructing matrix with dashing (jaccard index)
2023-10-06 12:14:33,822 - Hierarchical clustering
2023-10-06 12:14:34,117 - Constructing the tree
Traceback (most recent call last):
  File "StrainScan_build.py", line 162, in <module>
    main()
  File "StrainScan_build.py", line 127, in main
    31, params])
  File "/home/StrainScan/library/Build_tree.py", line 385, in build_tree
    kmer_index = extract_kmers(fna_mapping[i.identifier], fna_path, ksize, kmer_index_dict, kmer_index, Lv, spec, tree_dir, alpha_ratio, i.identifier)
  File "/home/StrainScan/library/Build_tree.py", line 115, in extract_kmers
    reverse = seqpy.revcomp(forward)
AttributeError: module 'seqpy' has no attribute 'revcomp'

When I try to use it with my own genomes, I get the following different error: python StrainScan_build.py -i /path/fasta/ -o /StrainScan_DB

2023-10-06 12:23:53,950 - Constructing matrix with dashing (jaccard index)
2023-10-06 12:23:54,539 - Hierarchical clustering
Traceback (most recent call last):
  File "StrainScan_build.py", line 162, in <module>
    main()
  File "StrainScan_build.py", line 116, in main
    cls_file, cls_res)
  File "/home/StrainScan/library/select_rep.py", line 46, in pick_rep
    final_res[ele[2]]=int(ele[0])
IndexError: list index out of range

Any ideas?

liaoherui commented 9 months ago

Hi, thanks for using StrainScan!

Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version. conda install -c bioconda strainscan

The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3.

For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks!

dugala239 commented 1 month ago

Hi, thanks for using StrainScan!

Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version. conda install -c bioconda strainscan

The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3.

For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks!

I have the same error when I used ~2000 genomes to build a customized database. Is it a problem when the genome number is too high?

liaoherui commented 1 month ago

Hi,

It's hard to say. Have you tried using StrainScan to build a database with a subset of your 2000 genomes? (e.g. with 100 genomes)? If it works, then it may be a result of large number genomes. Otherwise, it may be caused by other reasons.

However, according to my experience, the large number of genomes usually lead to memory issue rather than "index" error. Need to check. Sorry for the inconvenience.

dugala239 commented 1 month ago

Hey,

Thanks for your reply! ! A subset of genomes worked. I am using genome & MAGs for the database building. Do you think the poor quality of MAGs can cause this "index" problem? I did not encounter memory problems using 80 CPU and 600GB memory.

liaoherui commented 1 month ago

Hi,

Sorry for late reply. Can you send me your hclsMap_95.txt file in the output folder (usually, it's under "Outdir/Cluster_Result")? Then, I can quickly check the possible reason for this problem. Thanks!

dugala239 commented 1 month ago

Hi,

Sorry for late reply. Can you send me your hclsMap_95.txt file in the output folder (usually, it's under "Outdir/Cluster_Result")? Then, I can quickly check the possible reason for this problem. Thanks!

Dear Ray,

I have attached the "hclsMap_95.txt" file. Thanks for checking!

hclsMap_95.txt

liaoherui commented 1 month ago

Hi,

I tested the code (line 46, index error part) with the provided file (hclsMap_95.txt) and it worked well. This is a little strange. May I ask for the complete error log info you got? Thanks a lot for your time and assistance!

dugala239 commented 1 month ago

Hi,

I tested the code (line 46, index error part) with the provided file (hclsMap_95.txt) and it worked well. This is a little strange. May I ask for the complete error log info you got? Thanks a lot for your time and assistance!

This is the error log I got "2024-05-13 11:58:13,080 - Constructing matrix with dashing (jaccard index) 2024-05-13 11:58:51,360 - Hierarchical clustering 2024-05-13 11:59:00,113 - Constructing the tree Traceback (most recent call last): File "/lisc/user/ye/.conda/envs/strainscan/bin/strainscan_build", line 10, in sys.exit(main()) File "/lisc/user/ye/.conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 128, in main 31, params]) File "/lisc/user/ye/.conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 281, in build_tree for i in temp[2].split(","): IndexError: list index out of range strainscan_build-1505346.err (END)"

liaoherui commented 1 month ago

I see... It seems that this error is caused by hclsMap_95_recls.txt (the same folder with hclsMap_95.txt) rather than hclsMap_95.txt. Can you send me that file, too? Thanks!

dugala239 commented 1 month ago

I see... It seems that this error is caused by hclsMap_95_recls.txt (the same folder with hclsMap_95.txt) rather than hclsMap_95.txt. Can you send me that file, too? Thanks!

hey, I have attached the file. Thanks hclsMap_95_recls.txt hclsMap_95_Rep.txt

liaoherui commented 1 month ago

I checked these files and found there are some genomes with the same strain ID in the input (see below). Please make sure all input genomes have unique ID/Name and rerun the program to see whether it works. Thanks!

image

dugala239 commented 1 month ago

I checked these files and found there are some genomes with the same strain ID in the input (see below). Please make sure all input genomes have unique ID/Name and rerun the program to see whether it works. Thanks!

image

Dear Ray,

Thank you for your immediate assistance with troubleshooting.

My genome names are in "*bin.number.fna" format and the pipeline seems to cut off whatever is after bin, so some MAGs ended up with the same strain name causing the issue as mentioned above. The database is working at the moment. Nice pipeline!

liaoherui commented 1 month ago

Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan!

dugala239 commented 1 month ago

Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan!

Dear Ray,

I would like to hear your opinion about the analysis. Since I have ~2000 genomes, it can end up with many clusters. So I decided to define the clusters first and build a customized database. Can I compare the relative abundance of defined clusters in a metagenomic sample? The Strainscan output sums up the abundance of all strains to 1. Does it mean that I can only compare the composition of all clusters in a sample, but not the relative abundance.

liaoherui commented 1 month ago

"Can I compare the relative abundance of defined clusters in a metagenomic sample?"

Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample.

For example, you got something like ID Strain_Name Cluster_ID Relative_Abundance Predicted_Depth (Enet) Predicted_Depth (Ab*cls_depth) Coverage Coverd/Total_kmr 1 Strain_A C1 0.3984296647659111 87.85996285435535 1.995290772579857 0.8032051656635306 54981/68452 2 Strain_B C2 0.33126055378500013 6.515928000998808 1.658915449167767 0.5102015852823173 20855/40876 3 Strain_C C3 0.18988646589720443 3.7350856478774794 0.9509299802390313 0.4822947720497346 13266/27506 4 Strain_D C4 0.08042331555188444 1.5819346063093358 0.40275088330893394 0.5901400469521574 14580/24706

Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40).

In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.)

dugala239 commented 1 month ago

"Can I compare the relative abundance of defined clusters in a metagenomic sample?"

Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample.

For example, you got something like ID Strain_Name Cluster_ID Relative_Abundance Predicted_Depth (Enet) Predicted_Depth (Ab*cls_depth) Coverage Coverd/Total_kmr 1 Strain_A C1 0.3984296647659111 87.85996285435535 1.995290772579857 0.8032051656635306 54981/68452 2 Strain_B C2 0.33126055378500013 6.515928000998808 1.658915449167767 0.5102015852823173 20855/40876 3 Strain_C C3 0.18988646589720443 3.7350856478774794 0.9509299802390313 0.4822947720497346 13266/27506 4 Strain_D C4 0.08042331555188444 1.5819346063093358 0.40275088330893394 0.5901400469521574 14580/24706

Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40).

In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.)

Hey Ray,

Thanks for the detailed explanation.

dugala239 commented 1 month ago

Dear Ray,

One more question popped up. I ended up with C146, C154, ... C2225, which are not continuous cluster numbers. Did the pipeline filter out C1, C2, C3 etc.? And how?

Another thing is that I used ~2400 genomes and got 2200 clusters according to the the database. Does it make sense?

Thanks! hclsMap_95_recls.txt

liaoherui commented 1 month ago

Hi,

For the first question, if you found the non-continuous ID in the log, then don't panic. It's normal and a result of intra-cluster (only cluster with more than one strain will have this step) k-mer matrix construction step. If you check the file "hclsMap_95_recls.txt", you may find all strains are included in these clusters and all clusters are also included with continuous number.

For the second question, it could be. This could happen if your input strains genomes are highly divergent, which means their k-mer similarity is not very high.

dugala239 commented 1 month ago

Dear Ray,

Thanks for your help again!

On Mon, May 20, 2024 at 3:07 AM Ray @.***> wrote:

Hi,

For the first question, if you found the non-continuous ID in the log, then don't panic. It's normal and a result of intra-cluster (only cluster with more than one strain will have this step) k-mer matrix construction step. If you check the file "hclsMap_95_recls.txt", you may find all strains are included in these clusters and all clusters are also included with continuous number.

For the second question, it could be. This could happen if your input strains genomes are highly divergent, which means their k-mer similarity is not very high.

— Reply to this email directly, view it on GitHub https://github.com/liaoherui/StrainScan/issues/11#issuecomment-2119540818, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4XEWANJH6LLK622L3XKNKDZDFLHFAVCNFSM6AAAAAA5VX25HSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZGU2DAOBRHA . You are receiving this because you commented.Message ID: @.***>