combine genomes - Githubissues

Biofarmer commented 2 years ago

Hi Jiarong,

I am using virsorter2 v2.2.3 for thousands of genomes to check virus sequences. I have a few questions before running:

Can I combine all genomes into one file, of course, make sure the names of contigs are unique in the combined file?
Run each genome individually or combined file, which way is fast and better based on your experience?

Thanks Wang

jiarong commented 2 years ago

Hi, two ways should have the same result. With thousands of genomes, run each genome separately in parallel in computer cluster should be much faster.

Biofarmer commented 2 years ago

OK, thanks. If running genome individually, should I just use 4 threads (-j 4) for each genome as the HMMSEARCH_THREADS is set to 4? With more threads (like -j 10), there is no effect on this hmm search step, and the total spent time should be almost same, right?

Biofarmer commented 2 years ago

@jiarong Thanks. May I ask another questions?

I have installed virsorter2, and databases and dependencies by virsorter setup -d db -j 4 in a computer cluster. The computer cluster is managing with Slurm system, and when the job is submitted, there will be no internet available. May I ask whether any internet is needed when running the job when the databases and dependencies have been installed?
Another question about manually install databases and dependencies, I found that the dependencies in conda_envs is not installed but will be installed when running the first sample, which does not like virsorter setup -d db -j 4 that install dependencies in conda_envs along with downloading databases. May I ask if manually installing databases, whether the dependencies in conda_envs will be installed every time for each samples? or once installed, all rest samples will skip this step?
How many minutes it takes for test fa by running 'virsorter run -w test.out -i test.fa --min-length 1500 -j 4 all'. I run this code and changed 'virsorter config --set HMMSEARCH_THREADS=4', and it took 30 mins to finish. However, when inspecting the process by top, only one thread (100-150 in %CUP) seems to be used.

Thanks

jiarong commented 2 years ago

Right, -j 4 should be enough for a bacterial genome, and increasing threads wont have much improvement on speed. 1&2: Once db and dependencies are installed, internet connection is NOT needed, and dependency installation is skipped.
1. This could happen. HMMER can be limited by things other than CPU. In computer clusters, it's likely the speed to read the data over network. It usually takes 10 - 20 mins on my server.

Biofarmer commented 2 years ago

Right, -j 4 should be enough for a bacterial genome, and increasing threads wont have much improvement on speed. 1&2: Once db and dependencies are installed, internet connection is NOT needed, and dependency installation is skipped.

This could happen. HMMER can be limited by things other than CPU. In computer clusters, it's likely the speed to read the data over network. It usually takes 10 - 20 mins on my server.

@jiarong Thanks. May I further confirm with you that "When manually installing databases and dependencies, I found that the dependencies in conda_envs is not installed but will be installed when running the first sample" is the right way how dependencies in conda_envs from db are installed when manually installing databases and dependencies?

Thanks

jiarong commented 2 years ago

Dependencies can only be installed automatically. Usually, running the test example should have that done.

Biofarmer commented 2 years ago

Dependencies can only be installed automatically. Usually, running the test example should have that done.

Yes, I run the test and found that dependencies have been installed automatically. But after running 'virsorter config --init-source --db-dir=./db' and before running test example, conda_envs in /db directory is empty and dependencies are not installed, right?

jiarong commented 2 years ago

Correct.

Biofarmer commented 2 years ago

Thanks and it is good to know more about the virsorter2. May I further confirm whether the manually downloaded database (https://osf.io/v46sc/download) is updated and completely same as the database by 'virsorter setup' command (virsorter setup -d db -j 4)? Thanks

Biofarmer commented 2 years ago

@jiarong sorry, two following questions

The virus sequence is identified only within each/one contig, not across contigs, right?
If --keep-original-seq is used, is it necessary to trim the sequence like trimming did by virsorter2 afterwards, or CheckV will trim the sequences, which can be used directly for subsequent analyses?
In the SOP (https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-5qpvoyqebg4o/v2?version_warning=no&step=3), it said "A minimal length 5000 bp is chosen since it is the minimum required by downstream viral classification." May I ask which step is required for the length of 5000 bp?

Thanks

jiarong commented 2 years ago

right, the database manually downloaded is the same as the one installed by virsorter setup.

correct, NOT across contigs.
If you follow the SOP, the ends trimming is done by checkV.
Classification of viral sequences is not covered by the SOP, but it typically needs >5kb to have reliable taxonomy assignment.

Biofarmer commented 2 years ago

@jiarong Thank you very much!

jiarong / VirSorter2

combine genomes #113