jessieren / VirHostMatcher

VirHostMatcher: matching hosts of viruses based on oligonucleotide frequency (ONF) comparison
Other
28 stars 5 forks source link

VirHostMatcher not producing an outfile #1

Closed grean326 closed 7 years ago

grean326 commented 7 years ago

we have compiled the VirHostMatcher and ran the test. We get an out file running the test, but when we use our own data it runs and fails to produce and out file. We don't have an annotation file for our host data because they were also metagenomic assemblies. Is there a way to run VirHostMatcher without the host annotation file?

jessieren commented 7 years ago

Hi there,

Thank you for using VirHostMatcher.

VirHostMatcher should be able to run without the annotation file. Could you please check if there is a file in the /tmp directory called computeMeasureOut.log? Is there any content in that file? If not, could you please let me know how many viral contigs and host contigs are in your dataset. I will try to figure out the problem using the same size of the dataset as yours.

Thank you!

Jessie

grean326 commented 7 years ago

Hi Jessie,

The log file is empty. We are looking to identify the hosts of 261903 viral contigs against 683563 bacteria metagenome assemblies.

Thanks, Ann

On Wed, May 17, 2017 at 3:19 AM, Jessie Jie Ren notifications@github.com wrote:

Hi there,

Thank you for using VirHostMatcher.

VirHostMatcher should be able to run without the annotation file. Could you please check if there is a file in the /tmp directory called computeMeasureOut.log? Is there any content in that file? If not, could you please let me know how many viral contigs and host contigs are in your dataset. I will try to figure out the problem using the same size of the dataset as yours.

Thank you!

Jessie

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

-- "All our dreams can come true, if we have the courage to pursue them." ~Walt Disney

jessieren commented 7 years ago

Hi Ann,

Thank you very much for your feedback. I fixed the problem of the empty log file. Please download the new version of the file "vhm.py" and try again. The log file should be in the /tmp/vhm.log.

The new program will also output estimated time remaining (ETRs) on the screen. The estimation is accurate once it gets stabilized after a few iterations. The program includes 2 steps: 1) counting kmer frequencies 2) computing the virus-host distances/dissimilarities. The time cost for the step of counting kmer frequencies is linear with the length of the input sequence. The time cost for computing the distances/dissimilarities for one pair of virus-host is ~0.016 seconds for only computing d2star, ~0.060 seconds for computing all 11 distances/dissimilarities. The estimated average time for counting kmer frequencies of one virus/host sequence and the estimated average time for computing dissimilarities for one virus-host pair are also shown on the screen.

Suppose there are N virus contigs and M host contigs as the input, and average time for counting kmer frequencies of one sequence is t1 seconds, and the average time for computing distances/dissimilarities of one virus-host pair is t2 seconds, then the total time cost is (N+M)t1 + NM*t2. You may use this formula to roughly estimate the total time cost of running the program.

We strongly suggest using contigs longer than at least 5kb (>10kb is better) for predicting virus-host interactions, because our method preserves decent prediction accuracies for contigs >10kb, as shown in Fig 3A in the paper.

Please let me know if there are other problems. Thank you!

Best wishes, Jessie