Open NamenloseHeldin opened 9 months ago
Hi, thanks for using StrainScan!
Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version.
conda install -c bioconda strainscan
The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3.
For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks!
Hi, thanks for using StrainScan!
Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version.
conda install -c bioconda strainscan
The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3.
For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks!
I have the same error when I used ~2000 genomes to build a customized database. Is it a problem when the genome number is too high?
Hi,
It's hard to say. Have you tried using StrainScan to build a database with a subset of your 2000 genomes? (e.g. with 100 genomes)? If it works, then it may be a result of large number genomes. Otherwise, it may be caused by other reasons.
However, according to my experience, the large number of genomes usually lead to memory issue rather than "index" error. Need to check. Sorry for the inconvenience.
Hey,
Thanks for your reply! ! A subset of genomes worked. I am using genome & MAGs for the database building. Do you think the poor quality of MAGs can cause this "index" problem? I did not encounter memory problems using 80 CPU and 600GB memory.
Hi,
Sorry for late reply. Can you send me your hclsMap_95.txt
file in the output folder (usually, it's under "Outdir/Cluster_Result")? Then, I can quickly check the possible reason for this problem. Thanks!
Hi,
Sorry for late reply. Can you send me your
hclsMap_95.txt
file in the output folder (usually, it's under "Outdir/Cluster_Result")? Then, I can quickly check the possible reason for this problem. Thanks!
Dear Ray,
I have attached the "hclsMap_95.txt" file. Thanks for checking!
Hi,
I tested the code (line 46, index error part) with the provided file (hclsMap_95.txt
) and it worked well. This is a little strange. May I ask for the complete error log info you got? Thanks a lot for your time and assistance!
Hi,
I tested the code (line 46, index error part) with the provided file (
hclsMap_95.txt
) and it worked well. This is a little strange. May I ask for the complete error log info you got? Thanks a lot for your time and assistance!
This is the error log I got
"2024-05-13 11:58:13,080 - Constructing matrix with dashing (jaccard index)
2024-05-13 11:58:51,360 - Hierarchical clustering
2024-05-13 11:59:00,113 - Constructing the tree
Traceback (most recent call last):
File "/lisc/user/ye/.conda/envs/strainscan/bin/strainscan_build", line 10, in
I see... It seems that this error is caused by hclsMap_95_recls.txt
(the same folder with hclsMap_95.txt
) rather than hclsMap_95.txt
. Can you send me that file, too? Thanks!
I see... It seems that this error is caused by
hclsMap_95_recls.txt
(the same folder withhclsMap_95.txt
) rather thanhclsMap_95.txt
. Can you send me that file, too? Thanks!
hey, I have attached the file. Thanks hclsMap_95_recls.txt hclsMap_95_Rep.txt
I checked these files and found there are some genomes with the same strain ID in the input (see below). Please make sure all input genomes have unique ID/Name and rerun the program to see whether it works. Thanks!
I checked these files and found there are some genomes with the same strain ID in the input (see below). Please make sure all input genomes have unique ID/Name and rerun the program to see whether it works. Thanks!
Dear Ray,
Thank you for your immediate assistance with troubleshooting.
My genome names are in "*bin.number.fna" format and the pipeline seems to cut off whatever is after bin, so some MAGs ended up with the same strain name causing the issue as mentioned above. The database is working at the moment. Nice pipeline!
Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan!
Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan!
Dear Ray,
I would like to hear your opinion about the analysis. Since I have ~2000 genomes, it can end up with many clusters. So I decided to define the clusters first and build a customized database. Can I compare the relative abundance of defined clusters in a metagenomic sample? The Strainscan output sums up the abundance of all strains to 1. Does it mean that I can only compare the composition of all clusters in a sample, but not the relative abundance.
"Can I compare the relative abundance of defined clusters in a metagenomic sample?"
Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample.
For example, you got something like ID Strain_Name Cluster_ID Relative_Abundance Predicted_Depth (Enet) Predicted_Depth (Ab*cls_depth) Coverage Coverd/Total_kmr 1 Strain_A C1 0.3984296647659111 87.85996285435535 1.995290772579857 0.8032051656635306 54981/68452 2 Strain_B C2 0.33126055378500013 6.515928000998808 1.658915449167767 0.5102015852823173 20855/40876 3 Strain_C C3 0.18988646589720443 3.7350856478774794 0.9509299802390313 0.4822947720497346 13266/27506 4 Strain_D C4 0.08042331555188444 1.5819346063093358 0.40275088330893394 0.5901400469521574 14580/24706
Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40).
In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.)
"Can I compare the relative abundance of defined clusters in a metagenomic sample?"
Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample.
For example, you got something like ID Strain_Name Cluster_ID Relative_Abundance Predicted_Depth (Enet) Predicted_Depth (Ab*cls_depth) Coverage Coverd/Total_kmr 1 Strain_A C1 0.3984296647659111 87.85996285435535 1.995290772579857 0.8032051656635306 54981/68452 2 Strain_B C2 0.33126055378500013 6.515928000998808 1.658915449167767 0.5102015852823173 20855/40876 3 Strain_C C3 0.18988646589720443 3.7350856478774794 0.9509299802390313 0.4822947720497346 13266/27506 4 Strain_D C4 0.08042331555188444 1.5819346063093358 0.40275088330893394 0.5901400469521574 14580/24706
Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40).
In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.)
Hey Ray,
Thanks for the detailed explanation.
Dear Ray,
One more question popped up. I ended up with C146, C154, ... C2225, which are not continuous cluster numbers. Did the pipeline filter out C1, C2, C3 etc.? And how?
Another thing is that I used ~2400 genomes and got 2200 clusters according to the the database. Does it make sense?
Thanks! hclsMap_95_recls.txt
Hi,
For the first question, if you found the non-continuous ID in the log, then don't panic. It's normal and a result of intra-cluster (only cluster with more than one strain will have this step) k-mer matrix construction step. If you check the file "hclsMap_95_recls.txt", you may find all strains are included in these clusters and all clusters are also included with continuous number.
For the second question, it could be. This could happen if your input strains genomes are highly divergent, which means their k-mer similarity is not very high.
Dear Ray,
Thanks for your help again!
On Mon, May 20, 2024 at 3:07 AM Ray @.***> wrote:
Hi,
For the first question, if you found the non-continuous ID in the log, then don't panic. It's normal and a result of intra-cluster (only cluster with more than one strain will have this step) k-mer matrix construction step. If you check the file "hclsMap_95_recls.txt", you may find all strains are included in these clusters and all clusters are also included with continuous number.
For the second question, it could be. This could happen if your input strains genomes are highly divergent, which means their k-mer similarity is not very high.
— Reply to this email directly, view it on GitHub https://github.com/liaoherui/StrainScan/issues/11#issuecomment-2119540818, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4XEWANJH6LLK622L3XKNKDZDFLHFAVCNFSM6AAAAAA5VX25HSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZGU2DAOBRHA . You are receiving this because you commented.Message ID: @.***>
Hi! I would like to use your tool with my own reference database.
I installed using option 2 after reading through #10
However, when I try to run the StrainScan_build.py command with your test data, I get the following error:
~/StrainScan$ python StrainScan_build.py -i Test_genomes -o DB_Small
When I try to use it with my own genomes, I get the following different error:
python StrainScan_build.py -i /path/fasta/ -o /StrainScan_DB
Any ideas?