jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
228 stars 31 forks source link

How do I decide whether a contigs is a putative viral contig based on the results of virsorter2 #5

Open susutBu opened 4 years ago

susutBu commented 4 years ago

Dear jiarong: First of all, thank you very much for developing such a wonderful software. It works really well. When I compared virsorter2 with virsorter and virfinder respectively under the default parameters, I found that virsorter2 could generate more viral contigs, even tens of times, which made me excited and worried. We know that both virsorter and Virfinder results need to go through some filtering rules to get better results, such as "category1 and 2 " for virsorter and "score >0.7 and p < 0.05" for virfinder. So do I need to filter the results of Virsorter2 too? What are the rules and criteria for filtering?

These are the results of three software tests: Total contigs (input): 52497 viral contigs identified by virsorter2: 35750 viral contigs identified by virsorter: 1942 viral contigs identified by virfinder: 1943 It looks like virsorter's results are more similar to virfinder's

Thanks again. Looking forward your reply

jiarong commented 4 years ago

Right, filtering is recommended. The current default cutoff (0.5) is set based on simulated virome data from known genomes. With real metaG data that have more unknown sequences, the default cutoff tends to have high sensitivity with a trade off of more false positives. Shorter sequences also tends to have higher false positive rate. Higher cellular sequence proportion can also increase the false positive rate. So there is no cutoff that can fit all cases and that's why we leave it to the users.

From my experience so far, I recommend >0.75. There are a few options for filtering:

susutBu commented 4 years ago

Thank you for your detailed explanation. This information is very useful.

wfgui commented 4 years ago

Right, filtering is recommended. The current default cutoff (0.5) is set based on simulated virome data from known genomes. With real metaG data that have more unknown sequences, the default cutoff tends to have high sensitivity with a trade off of more false positives. Shorter sequences also tends to have higher false positive rate. Higher cellular sequence proportion can also increase the false positive rate. So there is no cutoff that can fit all cases and that's why we leave it to the users.

From my experience so far, I recommend >0.75. There are a few options for filtering:

  • Filter score based on the final-viral-score.tsv table.
  • --min-score option
  • --hallmark-required-on-short option requires sequence shorter than 5K to have hallmark gene
  • Only pick sequences with hallmark genes using hallmark filed in fasta header in final-viral-combined.fa for highest confidence

I have a question about final-viral-score.tsv:Some fasta header looked similar with highest confidence in final-viral-combined.fa, but did not appear in the final-viral-score.tsv at the same time. For example: in the final-viral-score.tsv:

>NODE_224_length_4468_cov_41.505099||full shape:circular||start:97||end:2840||group:NCLDV||score:0.94||hallmark:0

not in final-viral-score.tsv:

>NODE_255_length_3849_cov_51.199262||full  shape:circular||start:3223||end:3600||group:dsDNAphage||score:0.887||hallmark:0

I don't know what standard of this file for filtering . Thanks!

jiarong commented 4 years ago

@hjdong, Do you mean these two sequences are both in final-viral-combined.fa, but you can not find the second one in final-viral-score.tsv? If so, could you paste the sequence of >NODE_255_length_3849_cov_51.199262||full here? I need to look into it. Thanks.

wfgui commented 4 years ago

@hjdong, Do you mean these two sequences are both in final-viral-combined.fa, but you can not find the second one in final-viral-score.tsv? If so, could you paste the sequence of >NODE_255_length_3849_cov_51.199262||full here? I need to look into it. Thanks.

Yes,two sequences name:>SRR9161490_NODE_224_length_4468_cov_41.505099||full and >SRR9161490_NODE_255_length_3849_cov_51.199262||full

>SRR9161490_NODE_216_length_4595_cov_145.989207||full  shape:circular||start:562||end:4021||group:NCLDV||score:1.0||hallmark:0
TGAATAATGTCTCATATGAGGCAATAAGACAACAAGTAAAAAGATACGAAGATGAACTTAACGGACATATTATTAAGCAAAATAGAACACAATTCTTAGATGATGTAGCAGTAGATATTTTAGATCAACACAGGAAAGAAAATCCTGTTGTAATTATCAATAAAGACACTGATTCTAGGTTAAAACAATTAGAAGATGAGAACAAAAATCTGTTAATTAAAGTTGCACAACAGGCAGATAAGATATCTCAATTGAATGAAGATTTAAAAAATAAAATAGAACAAATGACTTCATTATTGTTAGAAAATAATGAGAAAACACTTCTGCTAGAGCAAAAAAAAGACCAGGCAGAAGAAATAAATCAATTAAAAGAACAACTGGATCGAGAGAAAAAAGAACAGTTAAATGAGATAGAAATGTTAAAAAATGAATTAGAAAAAGAAAAAAATAAAGGTTTCTTTGCACGTTTATTTGGAAAATAAACTGACTTTATTACACACAATCAGTGTTTAAATTGAACACTGATTTTTAATTTTTCGATACATAAAAATGACAAAAATAAAAATTTTAAGCACCCATTTTTGAAAAAATTGAGCACTTAATAAATGCAAAAATGATTTTTTAGATCAACAATTTTTAATTAAAAATGGATTTTAAAATTGAACTTTTGACACAAAAAAATGGCAAAAACAAAAATTTTAAGCACTCATTTTTGAAAAAATTAGCATTTAATAAATGCAAAAATAATTTTTTTAGATCAACATTTTTAATTTAAAAATGCACTTTAAAAATGAAATTTTGACACATTAAAATGACAAAAATAAAAATTTTAAGCACCCATTTTTGAAAAATTGAGCACTAAATAAATGAAAAAATAATTTTTTAGATCAATATTTTTAATTTAAAAATGCACTTTAAAAATGAATTTTAGACACAATAAAAAAACGCAAAATTAGTAAAAATGCCATTTTCTTCAAAAACGGCGATTTTAGGTCATTTTTGGGGTTTTAAAAATAGCAAGTAAGGACTTTGTGCATACACTGGAATCCCCTAAGGACTTTGTGCATACAAAGTCCTTGCGATACCTGTGCATACAAAGTCAAAAATGTGTTGTACACACATTTTATTTTTTGAGGATTTCATCCCCAAACCCCTTCGTAAAATCAATGTCTACAGTTGTAGCAAGCAACGTCTTTGGATGTCCAAAGACAGACAATGTACACTTCAACAGTGATAAAAAAACTGTCCTTTTAGAGATACGGTTTGAAGAGAGTATAGCAATGCGTTGGGTTTTTAAGATACTAGATTATGATACAATCTTTTTGTTGCCTAGGCTCACCAACGCGAAGGAGGTGAGAACGTGGAATTTGTTTGCAACTTTATAGTTTCTGTCGGAGCAAGTGTGGTCGCCTACTACATTTGCAAGTGGCTTGACGGAAATGACTAGGCAGCAAAAGCACAAACTGAGTGGATTGACTACCCACTCTTTTTTTATGTAAAAAAAAGAAGAACCGAGTGACTAGTTCAGTTCTTCTTAACGTGGCTTTTGTTTGCAACTTTGCCTACTTGAATTATATCATGAACTTTTCAAAAGTAAACGTCCGAGACTTAGTATGCGCTAAGTACGAGGACACGTACTTATTGACTACCTTTAAATCGTATCATGTATTTTAGATCATTAGTAGACTGTGCTTTTAAATTTATACTTGTTAAAATCATTGTATAAAGAACACCTAGAGAATGAAAAAGGAGCTTTGATACTAATGGACGGTATAGTGGCTTTTATAGTTGGGACACTTATTGGAATAGCTATTCATAGCTACATAAAAAAAAGAAAGAAAAAGGAATAGGCCTAATGTCATTAAGTTAAGGACATTATATTGAGGCACCTATATTTTCATACAGGTGTCCTCTTTTTTACTCTAGCATTGTACAGAGAAGTATTTACCATAGTGTGCTAGAGTAAAGCAGTGACTGGACACCGATAAAAATCGGTGTTATATTTTTTAGTGAAATAGGGTCAATAACTTCATTGATAATATAACGTGACCTAGGGGCCTCTTGGCCTTTGATTCTTTTTTGTCCTTTTATGTTTGTTTTAGAGAGGTGTTCATAACAATGTTAAGTGATGTTATTAAGAAGTATGGTGGAGAGAAGATGGATCCACTGGATGTATATAAAGATATTTTTAGAATTGGAGAAGGATTCATTCAGAAGGAGTATGAAGATAGTGGAAGTTTCAAAGCAAATCCAATTGCCTATTATAAGAATGAGAACGAAGATCATGGCCATTTCAGAATTATGTTTGAAGATAAGTTTGAAGAAATCTATCGAAATGAGCTTGTGAATGCAGATTTCTGTGTAATGAATGGTTTAACTTACTTTGGTGCAAAATATACATCGGATAGAGCTTCTAAAATGTGTGCATTGATATTTGATATTGATGGTGTTACAGATAATAGTTTGAATAATTTTTTTTATGCTGCATTTAATAAAGAATTTGATTATTATCCATTACCAAATTACGTGGCTTTAAGTGGGCATGGAATACATTTATATTATGTTTTTGAAGAACCAGTACCATTGTTCCCTAATTTGAAGCTTCAATTAAAGGAATTTAAATACTCTTTAACTGAAAAAATGTGGAACAAAAATACTTCTGTTGATGAGAAAGTACAAAAACAAGGAATCAATCAGCCTTTTAGAATATTGGGTGGAAAATGTAAAAAGAATGCTCCACTGGATAGAGTGGAAGTGTATAGAGTAAATCAGCATCCAGTCAACATAGAGTATTTGAATCGTTTTGTTCCCACTAAAATTGAGATTGATGAAAAAAAATTATTCAAGGAAAGTAAATTAACACTGGATCAGGCAAAGGAAAAGTATCCGGAATGGTACGAAAATAAGGTTGTAAAGGGTATAAGAAGCTATTGGACAGTAAAACGTGATCTATACGACTGGTGGATCCAACAAATAAAAAAAGAAGAAAATGGAGCCAGTTATGGCCACAGATATTTTTGTATTATGACATTGGTGATTTATGGCATAAAATGTGGTTTATCTAAAGATGAGATAGAACAGGATGCAATTGATTTGATACCGTTTCTAAACGGTTTAAATGAAAAAGAACCATTTACAGAGGAAGATATTAAATCAGCTTTAGAGTGTTATGATGAACGATACAATACTTTTCCTTTAAAAGATATTGAGAAATTAACGAATATTCGAATCGAAAGAAATAAACGTAATGGTCGAAAACAAGATCAACATATAAAAATTATGAATGCGATTCGTGATATTGAACATCCAAATGGCTCATGGATTAATAAAGAAGGAGCTCCAACGAAGCAATCAATAGTTCAAAAATGGAAATTAGAAAATCCTGAAGGAACAAAATATCAATGCGTTAAAGATACAGGTTTATCAAAAAACACAGTGAAAAAATGGTGGAACAATTAAC
>SRR9161490_NODE_224_length_4468_cov_41.505099||full  shape:circular||start:97||end:2840||group:NCLDV||score:0.94||hallmark:0
TGAAATACACAATAACAGGCTTTTCAGAAAAATTCGATATTCCTAAGGATGAAGTAATTAAAAACCTAAATACGACTTATAAATCGTATGTAACGAAAGAAAGAGGAATAACTTATATTGATGAACAGGCGGCGCGGCAGAAACAGGAAGAAACAAAAGTAGAAGAAACTGTAAGCACTATAAGCGAAGAACAGAAAGAACTAAACAATAATAATGCGCTCATAGATGGATATAAGGCGCAAATAAGTGAGCTAAAGCAGGAATTAGCGAAGGAAAGAGAGAAAAACAGCGAAACAGAAGCAAAGCTATTAGAAATGATGGATAAGGTTATAAAACTAACAGAGAATACACAGATTCTAATGGCGCAGATTCAAAGCCAGCACCAGCTTTTAATAGAGAACAACAAAAAGAAGCGAACTATAAGAGAAGTGTTTAGCGACTTTATAAAAAAAGAGAAGCCGTAATTTACGGCTTCTTTTTCTTACCAGTCTTTAAGTTTATCAAAGATGCTGGGCTTTTTCGGTGCTTTTTCAACGCTCAAATCAATCTCATCCTTGTATTCTTCAAGGAAATGTTTTGTAGCTGTCATAACGCTTTTACGCATTTTAACCGTTCCATCTTCATTTGTTTCATAAACTATATTTCCGTTTTTATCTGTTACTGGCTGTTTATTTTCATCTTTCTTTTCAATAGGTTTAATTGCGGCATCCTTAAACTTCTTCTTATCTTCTTTTGTGTTATGATAGGTTTCAATATAATCTACCATATTCGCAATAGTAAGAGCTTTTCGCGCTTGTGCTTTATAGATATTCAAATTCTCAGAAGTGGAATTATTATTGATATACTCCATCCACTTATCTACGGGGTTATCATCGAAGAACTTTTTAATTTTTTCTTTTTGCTCTTTTGTCATTTTGCTTTTTTCATTGCTCATAGCTTTTAGCCCCCTATGAATTTATTACTAAGATTATAGCAGTATAAGGGGAGATAGTCAAGCTGTTTGTAAAGTGTTTATATTTCCTTGAATTAAGGACAGCTTCAAATCTGCGTTTATAAATCTCAGAAAATCACTTTACGAATATGTTCGTAAATTAAATAATATTTCTTGACACTTTTTTATTTGCAAGCTATAATAGTAACCGTAAGAGAGAGTAACGAAAACACAACGATTTGAAAGGAACTATAAGTAATGTCTAAGATAGCATATTTGAGGGTAAGCACCACACATCAGAACACAGCGCGGCAGGAATACGCAATGCCAGCTGATATTGATAAGGTGTTTGAAGATAAGGCGAGCGGGAAGGACACAGAGCGCCCAGAGTTTAAGAAGATGCTCGATTATGTGCGCGAAGGTGATATAGTCTACTTTGAGAGCTTTTCCCGCATAAGCCGCAGTTTGCCCGATTTACTCAATACTCTTGATTATTTCACGCAAAAGGGCGTTTCCTTTGTGTCGCTGAAAGAGAACATCGACACGACGGGAGCAACGGGAAAGCTTATTGTGTCGGTGCTGGGTGCTATAAGTGCCTATGAGCGGGAAATAAACGCAGAACGGCGGGAATATGGCTACCGCAAAGCCCTTAACGAAGGGAAGGTAGGACGACCCAAAGCCGAAGTAAGCGACAAACTAAGAGAAGCAGTAAAACGCTGGCGTGCGGGAGAGATTACAGCGACCGAAGCAATGAGAATCAGCGGCACAACGCGAACAACGTTCTACAAGCTGGTGAAGAAAGAGGGGCTATAACCCCTCTTTTTTTGATTCAGCAGAGTTTAGAGAGACAATAAACCCGCCTTTTTTCTCCCTCAAATCTCGGCTGGAGGATTCCTCCAGCCACGCTAAGTAGCCACTGGGGGGCTATATTCGCGGAAAATCAAAAAAAGCAAGGCTATAATGGGGTTTGCGTGGTTCAACATCCCGCTAAATTATGCTTTTTTTAGTTAATAAATTTTAGGACAGAGCAGGATAAAGCATATAGAAGAAAATTGAAAGAAAAATGGTCGGGAATAATATACAGTAGTAATATCTAAGGGGAGTAATGAGAAATGAAAAGAACAGATAACTATACAGTAGTATCATTCAGAGTAGAAGAAGAATTAGCAGAGCAATTAAAGGCAGAAGCAAAGCGCAGGTATATGTCAGCGTCAGCTTATATAAGAAAACTATTAGTTTATGATTTGAAAGGGGAAAATAAGTAATGATAAGAAATAGCTTAGAAAATTTAATAAGTGAAGAAACAAAAAAAAGCCCAACATATTAGCTATACAAATAGCATAGAAATATGGTTGATAGCTTTTATCAAGAATATAAGCACCAAGCAGAACTAAAGCAAATGGAAGAAAATATTTATAATAGACTTATAAAAGATATAAATATTGAAATCGTAGATAAAGCAACGCCAGCTATTAAAGAATTAGATAAGCAAATAAGGGATATTTTCAAAAAATGAACGCTGGAGAAGTAGAAGCACAGAACTATTTGCTTAAAAATGGTTGGAAAGTAAAGAATCTAACGGCGTGTAAGGATTTTTTTAGTAAAGATATAGATTTCCTAATAGAGAGAGATTAGGAAAGATTTTATATAGAAGTCAAATGGGACACTAAAATTAAACATACTGGTAATATGTTTATAGAAGTTAGCGCAGATATAGAAAACAACAAAGACGGCTGGTATAATTATTGTGAAGCAGACTTTATTTTCTATGGAGATGCTTTGAATAAATTGTTTTATGTATTTAGATTATAGG
>SRR9161490_NODE_255_length_3849_cov_51.199262||full  shape:circular||start:3223||end:3600||group:dsDNAphage||score:0.887||hallmark:0
TAGTCCTCGTCTTCTATATCTAACATTTCCTCTATTTGAGTTATACGTTTTTCTAAGTCTAAAAGTTGTATTGCTTTGATATTTGCATTACTACAACTTAGTATTGCTTTTCCTTGTGAAGGTGTTACTGTTCCCTCTCTTATTTCCTTAGCAAGTTTCTCATTAGTTGCTAGTATCTGTTTTAAGTTTCTTATTGCTATGTAATCTGAATCTTTTACGAACTCCATTTCTGCCACCTCCTAACCTGTTTCATGTATTAGCTTTCTTATTAAAGCACTTTTGGTCATGTTATATTTATTTGCCAACTGTTCTAGTTTCTCTATATCTTTTTTGGATACTCTCACTTCTAGTCTTTTATCTTTTATATCTTTTTTCATA
>SRR9161490_NODE_293_length_3437_cov_5738.632170||full  shape:circular||start:747||end:3239||group:lavidaviridae||score:0.98||hallmark:0
TGGCACAAGCATCAATACACTTCGAGCCCGTCAAGGGCGGCAGCGAGGAACACAACAGACGTTTGAAGTTCCTCGATTACGTGCGACCTGATCGCACGCACCTCAACGACTATTGGGAGAGCGGAACGCAAAGCGATCGCCTTGCAAACATCACCCAAAATTTCCTCGAACATCACCCAACTCGCAAGAAGCTTCACGCAAAAGCAACCCCCATCAGGGAGGCAGTCGTGAACATCACCGAAGAAACCACGATGACCGACCTCTTGCGTTTGGGGTCACGGCTTAATGAACGTTTTGGCATAAGCATCTTCCAAATTGCCATTCACAAGGATGAGGGGTATTTCGGTTCAGATACCGACAAACTGAACCTTCATGCCCACCTCGTAGCAGATTGGACGAACCCAAGCAATGGCGAATCTATCAAACTCAATCGGCAAGACATGGCAGAGATGCAGACCATCACCGCAGAGGTTCTTGGGATGCAGCGAGGTGTTTCTTCTGATAAGAAGCATCTCACAGCTATGCAGTATAAAGAGCAGAAAGCACGTGAGGAAGCGGAGAAAGCAAAGCAAGAACAACTCAAAGCGGAATCCGCCCAGAGAGTTGCCGAGCGCAAAGCTGCTGAAGCCATGGAGAAGAAGAAAACGGCAGAAGCGGCAGCGGTGAGCGGCTTAGTTGTCGAAAGTACTAAGAAACTCGGCAATCTGCTCGGCTTTGGCAAGGAAGCAAAGGCACTGAAGGAACTACCTGCACAACTGGATGCCGCAAAAGCTGAAGGACGAGCGGAAGCGGTCGAAGAGGTTCTGAAGGGAGCAGGCATGAAGTACAACGATATGTCGAAAGTAACCCCCGAGAAGGTCGGAAAAGACTTGATGAACATAGTTCACAAGAATGCAGAAGCCGCACAAGAGGACACGAAGAAACTTAGAATCATACAGAATATGACAGAAGGAAATTACACTTATGATGCAGCTGCAAAGCTCATCAAAGAGAAATATGCCGATATGGCATACTACAAAGAAGCTTTTGGGTACGCAGGTAGTGCAGATGCATTCGACTTCAAAAAAGAAGTGTTCAACCCGCTCTGTGAGCGTCAGGGAGCGCACAACACTTCTAGTGATGAACACGCAATCGGTCGCAGAGAAATATGCGCACAGGGCATTGTATGCGCCTGCATCCGCTTCTTTGATTCTTTCAAACTCGATAAGATAGCAAAGACCCTTAAAGCGATGGCACGAGATTTCAGTCTCGCCGATTGGCGAAAGCAACAAGAATACTCTCGCCAACTCCAGGAGCAGAACCAGGAACGGAAGAACCAGGAGCAGAAGCAGGGAAGAGGATGGACTTTCAGAAGATAGCAACAAAAAGAGCACTCGAGGTTATCCCTGAGTGCTCTTTTTGTTGAATAGTTCTAACTCTTTTCCCTTCATCTTCTCGTGGTTGATGATTTCCATGAGGTGGTCTGGTGCTACCATTTCCGTTAGAATCTCTCCTGAATCTACATCATAATAGCCAAACTTATCTTCTATCTCTTTCCATGAGAATGTTTTCCAAGACATATCTTGCAGGAAAATAGCTGCTTCTTCTTCCGTCTCTTCTTCGTTGACTTCTCCGCACTTCCAACTTTCGTCCGAGCCGATGCGTTTATATACCTGCCTTCCGCTCATCGCTTCAAATATCACGTTCATGTGCTTTGCATCAAAGAAGCGTTTTCCGTTGCTGTCCTTTGCAACCAACTTCGTGAAGTACTTGAATATCTCAAGATATGCTTTTTTGTTCGTCACTTCCTGATAGTTTTGGGCTGCAGGTGACGACTCTGGGTTTATCTTGAGCCACTTTTCTATAATCTTCAAGGCTGCTTCCTTATTGTTCACTAGTATGTGAAAGTGCGGATGGTACGTGTCGTACCACTCATCTTTCTCAGATGTACGTTCGAATCTCTGTACCTTGTACCACTTTCCGTGTTCTTGCTTCCACTCAGGTGGCAGCTTCTTCAAAACATATCTGTTTGCGTGATAGGTACACTCCAGTTTCTTGATGCCTATCATATTCTCGAGCACTGTTGCACGGCGAAACCACTTTGAACTTTTGATTAATTGCCACTTCTTGTTATAAGTTGCAATCTCTTCAGGAAGCTGCTCTGCTCGAACGTTCGGACGTGTCAGGGTGATGAAATATAACTCTTTCTCATCCTGCAATCGAGGTGCATACGCATTAATAAGCGTACCCATCCTGATGCGCTGACACTGCGGACACCACCTATTTTTGCAGTACTTTGCAGTAATTCTTCCGTTGCCTTGATACAACTTTTCACAGCAGTGGAAGGAGTTCTGATATCTAGTTCTAAGACTAGAATCAGGGTTCTGATAGTACAACATACTAGCGAGGTGATAGCCAAACCACCTGTATTTGTTCTTTTTTCGCAATGCGATAGTGACTTTTAACGCAGAATCTTTTGCGTTCTCGGAATTTTTATCTAACTTTGCACCCATG

final-viral-boundary.txt final-viral-combined.fa.txt final-viral-score.txt

Thanks!

jiarong commented 4 years ago

@hjdong, can you send the original sequence of >SRR9161490_NODE_255_length_3849_cov_51.199262 in the input sequence for VirSorter2?

>SRR9161490_NODE_255_length_3849_cov_51.199262||full in the output is very short (< 400bp), there should be a bug dealing with such short sequences. For practical purpose, you can also just remove such short sequences (< 1kb or larger). These short sequences are generally not reliable, unless they have a hallmark gene.

wfgui commented 4 years ago

original sequence

Here is the original sequence for VirSorter2(version 2.0.beta).Contigs less than 1 kb in length were discarded. My command:

virsorter run -w SRR9161490 -i SRR9161490.contigs.1k.fa.txt -j 12 -d vir2

SRR9161490.contigs.1k.fa.txt

jiarong commented 4 years ago

@hjdong, I have fixed the issue. Now the sequence names in final-viral-combined.fa and final-viral-score.tsv should be the same. I also add a few extra cols. so you can filter the score table with score, sequence length, hallmark gene count, viral gene % and cellular gene %.

Thanks for providing the data for me to reproduce the issue.

msevi commented 4 years ago

Hi @jiarong, I downloaded the conda version, and that discrepancy doesn't seem to be solved.

Regards, Maria

Jiulong-Zhao commented 3 years ago

Right, filtering is recommended. The current default cutoff (0.5) is set based on simulated virome data from known genomes. With real metaG data that have more unknown sequences, the default cutoff tends to have high sensitivity with a trade off of more false positives. Shorter sequences also tends to have higher false positive rate. Higher cellular sequence proportion can also increase the false positive rate. So there is no cutoff that can fit all cases and that's why we leave it to the users.

From my experience so far, I recommend >0.75. There are a few options for filtering:

  • Filter score based on the final-viral-score.tsv table.
  • --min-score option
  • --hallmark-required-on-short option requires sequence shorter than 5K to have hallmark gene
  • Only pick sequences with hallmark genes using hallmark filed in fasta header in final-viral-combined.fa for highest confidence

Hi, @jiarong , you mentioned the "--hallmark-required-on-short" option requires sequences shorter than 5K to have hallmark gene, but the --help document says "--hallmark-required-on-short require hallmark gene on short seqs (length cutoff as "short" were set by "MIN_SIZE_ALLOWED_WO_HALLMARK_GENE" in template-config.yaml file, default 3kbp); this can reduce false positives at reasonable cost of sensitivity [default: False]". So how can I change the --hallmark-required-on-short option? like "--hallmark-required-on-short 5000", or not. Meanwhile, I didn't find the template-config.yaml file. So, how should I write in the command line?

Thank you so much! Looking forward to your reply!

jiarong commented 3 years ago

@Jiulong-Zhao the default setting of MIN_SIZE_ALLOWED_WO_HALLMARK_GENE has changed. You can modify it for each viral group in template-config.yaml. If you installed the development version, it should be in VirSorter2/virsorter directory. If you installed bioconda, it is tricky. You can update to the newest and run virsorter config --show-source to track it.

Jiulong-Zhao commented 3 years ago

@Jiulong-Zhao the default setting of MIN_SIZE_ALLOWED_WO_HALLMARK_GENE has changed. You can modify it for each viral group in template-config.yaml. If you installed the development version, it should be in VirSorter2/virsorter directory. If you installed bioconda, it is tricky. You can update to the newest and run virsorter config --show-source to track it.

@jiarong Thank you for your answer! I have found the template-config.yaml file in that directory! Thanks again!