Closed Rohit-Satyam closed 7 months ago
Hi, thanks for using VirStrain!
As you know, it's Chinese New Year, and I am on the vacation this week. I will check this issue once I go back to work next week! Apologize for the inconvenience caused by this.
No issues. Please enjoy your holiday. Happy New Year to you and your family.
I will keep posting my queries and you can attend them at your convenience. This is a really nice tool and I really want to use it in our two viral sequencing projects related to DENV and FMDV.
One more observation, since the tool create file called Tem_Vs
I am not able to run multiple samples using GNU parallel since this file is overwritten or deleted when multiple jobs are fired from the same directory as can be seen below when I ran virstrain separately on two terminal but from same directory.
rm: cannot remove 'Tem_Vs*': No such file or directory
rm: cannot remove 'Tem_VS*': No such file or directory
It would be helpful if this file is named randomly so that when multiple jobs are fired it doesn't get overwritten or deleted.
Finally for DENV and FMDV do you recommend using --high_mutation_virus
option?
Hi, sorry for the late reply. For the contig-mode of VirStrain, the current version only supports the pre-built database we provided. If you want to use it on your custom database, then you need to copy the "Pos-snp-kmer-all.fa" to the same dir of the DB and modify its headline. For example:
I have a database named "Adv" (see below): Then, I need to rebuild a new database called Adv_contig (see below): where "Adv" is the raw database, and "merge_db.fa" is a modified version of "Pos-snp-kmer-all.fa". The main modification is that its headline contains the "_Adv" information. (as shown below) Note, the name after the dash must be the same as the name of raw database.
Then, you may be able to use this database for the contig-level viral strain identification.
For the suggestion about "Tem_Vs", I will update the program acoording to the requirement asap.
--high_mutation_virus
is not recommended for DENV and FMDV.
Sorry for the inconvenience caused by the problem. Any further problems, please let me know. Thanks!
Thanks @liaoherui. Last week I checked how Virstrain performed in the serotype assignment and subclade assignment and I was glad to see it was able to accurately identify both with 100% accuracy for Illumina and MinIon datasets. Thanks for developing and maintaining this tool. Your tool is a savior.
For the suggestion about "Tem_Vs", I will update the program according to the requirement asap.
Very many thanks for keeping this tool updated. Yes, we have more than 1K dengue samples and parallization can help.
For assembly-level comparison, I will test your recommendations and get back.
Hi, Rohit,
I have updated the tool to support the random name of Tem_Vs files (only GitHub version). And you can try the new version now.
For contig-level identification, I would say this tool is originally designed for reads, so I am not sure how good the performance will be at the contig level. But you can give it a try!
Any other problems, please let me know!
Hi @liaoherui
I was testing the Virstrain and thanks for the changes you made. I am facing the following error:
Traceback (most recent call last):
File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/VirStrain/S4_Plot_strain_cov.py", line 12, in <module>
sfl=pd.read_csv('Mps_ps_depth.csv')
File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
The files I have now are Minion files (in-house) and some files run fine while the rest throw the error above. I tested your tool on the publically available Minion dataset and it works well. I think it's a memory issue. Is there a way to deal with this memory issue?
Hi, @Rohit-Satyam
This problem is caused by S4_Plot_strain_cov.py. I have added one parameter "-f" to turn off figure generation. In you case, you can add the parameter -f 1
in your command to avoid this problem.
I tested this and now I get the following message:
/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
I hope this is just a warning (because I have results for all the samples) and can be either silenced or can be ignored by me right? Besides, for one last sample, I get the following error:
bin/S3_Strain_pred_My_Method_V0819_Val.py:170: RuntimeWarning: invalid value encountered in true_divide
weighted_freq_arr=freq_arr/np.sum(freq_arr)
Traceback (most recent call last):
File "bin/S3_Strain_pred_My_Method_V0819_Val.py", line 328, in <module>
pre_freq_arr=pre_freq_arr[keep]
IndexError: list index out of range
I think this error is because of poor genomic coverage maybe (haven't looked at the genomic coverage of other samples that ran successfully) because this sample covers only 38% of genome when quickly analysed by genomedetective. I will still attach the fastq file just in case if you want to confirm its not a code issue.
Hi, Rohit,
Usually, the warning information can be ignored.
For the error, it is very likely that low sequencing depth is causing the problem. When the depth is pretty low, there could not be enough/valid k-mers for identifying viral strains. And the error above could happen.
For the data provided, would you mind also provide the pre-built database you used to analyze the data? In this case, I can repeat the error to confirm the real reason. Thanks!
Sure. Here is the link to the database (am using free service so file will disappear after 5 days): https://fastupload.io/W9zM2TQL1LwjIRy/file
Hi, Rohit,
I have checked the data. The problem is really caused by low coverage. In this case, I recommend you add -s 0.8
or -s 0.9
when you use VirStrain_build.py to build the database. By doing this, the database can contain more genome sites and it's very likely you can also obtain result for this data.
However, I will say the low-depth data is still a limitation of VirStrain. I will consider extending the program to handle this problem in the future. Sorry for the inconvenience caused by this. Further problems or results, please let me know! Thanks!
Thanks for your response @liaoherui. One last thing before I close this issue. Can you update the VirStrain_contig
help section and push the changes you made so far to the conda package as well? That way I can easily include it in my nextflow pipeline.
Very many thanks for being such a wonderful developer and helping me out.
Hi, Rohit,
Sure. I will update the help section and the tool on Bioconda when I am free. This process may take about 1 week. Sorry for the inconvenience caused by this.
Hi, Rohit,
I have updated the package to V1.17. Now you can try the latest version via both GitHub and Conda. Furthermore, I have added one new script (VirStrain_contigDB_merge.py
or virstrain_merge
(if you install virstrain using Conda)) to help users to convert read-level databases to contig-level databases. Any related problems, please let me know. Again, thanks for using VirStrain and your valuable suggestions!
Hi @liaoherui
I downloaded virstrain using conda and it appears that
VirStrain_contig
is missing from the conda version. So I clone your repository to use it but I see that the help section of this script is same asvirstrain.py
I anyway ran the script but then got the following error:
Note: I created my database from 4.7K complete sequences as follows:
These genomes have at least 60% of genome similarity when checked with pyANI so is it valid to set
-s 0.6
i.e use at least 60% of the total useful sites?