omics-lab / VirusTaxo

2 stars 1 forks source link

Questions and suggestions about the program. #2

Open SergeyBaikal opened 1 year ago

SergeyBaikal commented 1 year ago
  1. I tested my own dataset and the program assigned a genus even for bacterial contigs (RNA model). It would be great if there was an entropy setting to skip false positives. For example less than 0.5. python3 predict.py --model_path /home/sergey/VirusTaxo/Dataset/vt_db_rna_virus_kmer_17.pkl --seq /home/VirusTaxo/My_Data/contigs.fasta > /home/VirusTaxo/My_Data/Results.txt
  2. Why not make a complete taxonomic line in the output file?
  3. I only got the correct assignment for one contig (from 15000 seq) more with an entropy setting of -3.15E-12, where there were 0 the taxonomy assignment was not correct.

Dear authors, could you clarify please what I'm doing wrong?

Rashedul commented 3 days ago

Dear Sergey,

Thank you for testing the tool and sharing your valuable feedback! I’d like to address your observations and questions:

The tool employs a k-mer matching strategy, meaning that any random overlap of k-mers between the query sequence and the database could lead to a genus assignment, even if the taxonomy (e.g., RNA viruses) is not as expected. To mitigate this, we’ve introduced a new metric called the "Enrichment Score," which helps reduce the likelihood of random k-mer matches affecting the predictions.

Additionally, this model is specifically designed for predicting viral sequences. Applying it to non-viral sequences may result in incorrect taxonomic assignments. To provide further clarity, we’ve included a new section in the README titled "Method Limitations and Interpretation" to elaborate on these points.

In future updates, we will add full taxonomic lineage (e.g., family, order, genus, species) in the output file, and will provide arguments to choose cutoff for both Entropy and Enrichment_Score.

Please let us know if you have further questions!