liaoherui / VirStrain

An RNA virus strain-level identification tool for short reads.
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02609-x
MIT License
20 stars 1 forks source link

revision required for help section of VirStrain_contig and error resolution for "Could not open read file "../database//merge_db.fa" for reading; skipping" #14

Closed Rohit-Satyam closed 7 months ago

Rohit-Satyam commented 9 months ago

Hi @liaoherui

I downloaded virstrain using conda and it appears that VirStrain_contig is missing from the conda version. So I clone your repository to use it but I see that the help section of this script is same as virstrain.py

>python VirStrain_contig.py -h
usage: VirStrain.py [-h] -i INPUT_READS [-p INPUT_READS2] -d DB_DIR
                    [-o OUT_DIR] [-c SF_CUTOFF] [-v CV_MODE] [-s RK_SITE]
                    [-m [HM_VIRUS]]

VirStrain - An RNA virus strain-level identification tool for short reads.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_READS, --input_reads INPUT_READS
                        Input fastq data --- Required
  -p INPUT_READS2, --input_reads2 INPUT_READS2
                        Input fastq data for PE reads.
  -d DB_DIR, --database_dir DB_DIR
                        Database dir --- Required
  -o OUT_DIR, --output_dir OUT_DIR
                        Output dir (default: current dir/VirStrain_Out)
  -c SF_CUTOFF, --site_filter_cutoff SF_CUTOFF
                        The cutoff of filtering one site (default: 0.05)
  -v CV_MODE, --comprehensive_mode CV_MODE
                        If set to 1, then VirStrain will identify viral
                        strains for input contigs in a more comprehensive
                        database. (default: 0)
  -s RK_SITE, --rank_by_sites RK_SITE
                        If set to 1, then VirStrain will sort the most
                        possible strain by matches to the sites. (default: 0)
  -m [HM_VIRUS], --high_mutation_virus [HM_VIRUS]
                        If the virus has high mutation rate, use this option.
                        (default: Not use)

I anyway ran the script but then got the following error:

python VirStrain_contig.py -i ../test.fasta  -d ../database/ -o test_assembly -v 1
Settings:
  Output files: "600bb1aecab911ee9eb43ceceff62d3e.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  ../test.fasta
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 2676
Using parameters --bmax 2007 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 2007 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:00
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 10707 (target: 2006)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 10707 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 10708 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 3425
fchr[G]: 5643
fchr[T]: 8426
fchr[$]: 10707
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4198060 bytes to primary EBWT file: 600bb1aecab911ee9eb43ceceff62d3e.1.bt2
Wrote 2684 bytes to secondary EBWT file: 600bb1aecab911ee9eb43ceceff62d3e.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 10707
    bwtLen: 10708
    sz: 2677
    bwtSz: 2677
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 670
    offsSz: 2680
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 56
    numLines: 56
    ebwtTotLen: 3584
    ebwtTotSz: 3584
    color: 0
    reverse: 0
Total time for call to driver() for forward index: 00:00:00
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to reverse reference sequence: 00:00:00
bmax according to bmaxDivN setting: 2676
Using parameters --bmax 2007 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 2007 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:00
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 10707 (target: 2006)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 10707 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 10708 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 3425
fchr[G]: 5643
fchr[T]: 8426
fchr[$]: 10707
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4198060 bytes to primary EBWT file: 600bb1aecab911ee9eb43ceceff62d3e.rev.1.bt2
Wrote 2684 bytes to secondary EBWT file: 600bb1aecab911ee9eb43ceceff62d3e.rev.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 10707
    bwtLen: 10708
    sz: 2677
    bwtSz: 2677
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 670
    offsSz: 2680
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 56
    numLines: 56
    ebwtTotLen: 3584
    ebwtTotSz: 3584
    color: 0
    reverse: 1
Total time for backward call to driver() for mirror index: 00:00:00
stat: Bad file descriptor
Warning: Could not open read file "../database//merge_db.fa" for reading; skipping...
Error: No input read files were valid
(ERR): bowtie2-align exited with value 1
Traceback (most recent call last):
  File "VirStrain_contig.py", line 153, in <module>
    sys.exit(main())
  File "VirStrain_contig.py", line 101, in main
    contigs,uid=split_contig(in_read1,db_dir+'/merge_db.fa',o)
  File "VirStrain_contig.py", line 16, in split_contig
    f=open(uid1+'.sam','r')
FileNotFoundError: [Errno 2] No such file or directory: '600bb1aecab911ee9eb43ceceff62d3e.sam'

Note: I created my database from 4.7K complete sequences as follows:

mafft --auto --thread 10 temp.fasta > mafft.aln
virstrain_build -d database -s 0.6 -i mafft.aln

These genomes have at least 60% of genome similarity when checked with pyANI so is it valid to set -s 0.6 i.e use at least 60% of the total useful sites?

liaoherui commented 9 months ago

Hi, thanks for using VirStrain!

As you know, it's Chinese New Year, and I am on the vacation this week. I will check this issue once I go back to work next week! Apologize for the inconvenience caused by this.

Rohit-Satyam commented 9 months ago

No issues. Please enjoy your holiday. Happy New Year to you and your family.

I will keep posting my queries and you can attend them at your convenience. This is a really nice tool and I really want to use it in our two viral sequencing projects related to DENV and FMDV.

One more observation, since the tool create file called Tem_Vs I am not able to run multiple samples using GNU parallel since this file is overwritten or deleted when multiple jobs are fired from the same directory as can be seen below when I ran virstrain separately on two terminal but from same directory.

rm: cannot remove 'Tem_Vs*': No such file or directory
rm: cannot remove 'Tem_VS*': No such file or directory

It would be helpful if this file is named randomly so that when multiple jobs are fired it doesn't get overwritten or deleted.

Finally for DENV and FMDV do you recommend using --high_mutation_virus option?

liaoherui commented 8 months ago

Hi, sorry for the late reply. For the contig-mode of VirStrain, the current version only supports the pre-built database we provided. If you want to use it on your custom database, then you need to copy the "Pos-snp-kmer-all.fa" to the same dir of the DB and modify its headline. For example:

I have a database named "Adv" (see below): image Then, I need to rebuild a new database called Adv_contig (see below): image where "Adv" is the raw database, and "merge_db.fa" is a modified version of "Pos-snp-kmer-all.fa". The main modification is that its headline contains the "_Adv" information. (as shown below) Note, the name after the dash must be the same as the name of raw database. image

Then, you may be able to use this database for the contig-level viral strain identification.

For the suggestion about "Tem_Vs", I will update the program acoording to the requirement asap.

--high_mutation_virus is not recommended for DENV and FMDV.

Sorry for the inconvenience caused by the problem. Any further problems, please let me know. Thanks!

Rohit-Satyam commented 8 months ago

Thanks @liaoherui. Last week I checked how Virstrain performed in the serotype assignment and subclade assignment and I was glad to see it was able to accurately identify both with 100% accuracy for Illumina and MinIon datasets. Thanks for developing and maintaining this tool. Your tool is a savior.

For the suggestion about "Tem_Vs", I will update the program according to the requirement asap.

Very many thanks for keeping this tool updated. Yes, we have more than 1K dengue samples and parallization can help.

For assembly-level comparison, I will test your recommendations and get back.

liaoherui commented 8 months ago

Hi, Rohit,

I have updated the tool to support the random name of Tem_Vs files (only GitHub version). And you can try the new version now.

For contig-level identification, I would say this tool is originally designed for reads, so I am not sure how good the performance will be at the contig level. But you can give it a try!

Any other problems, please let me know!

Rohit-Satyam commented 8 months ago

Hi @liaoherui

I was testing the Virstrain and thanks for the changes you made. I am facing the following error:

Traceback (most recent call last):
  File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/VirStrain/S4_Plot_strain_cov.py", line 12, in <module>
    sfl=pd.read_csv('Mps_ps_depth.csv')
  File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

The files I have now are Minion files (in-house) and some files run fine while the rest throw the error above. I tested your tool on the publically available Minion dataset and it works well. I think it's a memory issue. Is there a way to deal with this memory issue?

liaoherui commented 8 months ago

Hi, @Rohit-Satyam

This problem is caused by S4_Plot_strain_cov.py. I have added one parameter "-f" to turn off figure generation. In you case, you can add the parameter -f 1 in your command to avoid this problem.

Rohit-Satyam commented 8 months ago

I tested this and now I get the following message:

/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/home/subudhak/miniconda3/envs/virstrain/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

I hope this is just a warning (because I have results for all the samples) and can be either silenced or can be ignored by me right? Besides, for one last sample, I get the following error:

bin/S3_Strain_pred_My_Method_V0819_Val.py:170: RuntimeWarning: invalid value encountered in true_divide
  weighted_freq_arr=freq_arr/np.sum(freq_arr)
Traceback (most recent call last):
  File "bin/S3_Strain_pred_My_Method_V0819_Val.py", line 328, in <module>
    pre_freq_arr=pre_freq_arr[keep]
IndexError: list index out of range

I think this error is because of poor genomic coverage maybe (haven't looked at the genomic coverage of other samples that ran successfully) because this sample covers only 38% of genome when quickly analysed by genomedetective. I will still attach the fastq file just in case if you want to confirm its not a code issue.

MK-1-9_barcode17.zip

liaoherui commented 8 months ago

Hi, Rohit,

Usually, the warning information can be ignored.

For the error, it is very likely that low sequencing depth is causing the problem. When the depth is pretty low, there could not be enough/valid k-mers for identifying viral strains. And the error above could happen.

For the data provided, would you mind also provide the pre-built database you used to analyze the data? In this case, I can repeat the error to confirm the real reason. Thanks!

Rohit-Satyam commented 8 months ago

Sure. Here is the link to the database (am using free service so file will disappear after 5 days): https://fastupload.io/W9zM2TQL1LwjIRy/file

liaoherui commented 8 months ago

Hi, Rohit,

I have checked the data. The problem is really caused by low coverage. In this case, I recommend you add -s 0.8 or -s 0.9 when you use VirStrain_build.py to build the database. By doing this, the database can contain more genome sites and it's very likely you can also obtain result for this data.

However, I will say the low-depth data is still a limitation of VirStrain. I will consider extending the program to handle this problem in the future. Sorry for the inconvenience caused by this. Further problems or results, please let me know! Thanks!

Rohit-Satyam commented 8 months ago

Thanks for your response @liaoherui. One last thing before I close this issue. Can you update the VirStrain_contig help section and push the changes you made so far to the conda package as well? That way I can easily include it in my nextflow pipeline.

Very many thanks for being such a wonderful developer and helping me out.

liaoherui commented 8 months ago

Hi, Rohit,

Sure. I will update the help section and the tool on Bioconda when I am free. This process may take about 1 week. Sorry for the inconvenience caused by this.

liaoherui commented 8 months ago

Hi, Rohit,

I have updated the package to V1.17. Now you can try the latest version via both GitHub and Conda. Furthermore, I have added one new script (VirStrain_contigDB_merge.py or virstrain_merge (if you install virstrain using Conda)) to help users to convert read-level databases to contig-level databases. Any related problems, please let me know. Again, thanks for using VirStrain and your valuable suggestions!