rcedgar / palm_annot

Scripts, HMMs and search databases for identifying and classifying viral RdRp sequences
6 stars 2 forks source link

output report #1

Open 2110537 opened 6 months ago

2110537 commented 6 months ago

hi rcedgar,

Thanks for your powerful software.

I have a batch of aa seqs and RdRp already verfied. I try to use palm_annot and palmDB by blastp to annotation the seqs. I also used your palmscan, the output reports of palm_annot and palmscan are similar, and I want to know in what score level we can classify seqs to pssm_group1 in confidence, and when should I do futher noval virus check. I also need your suggestions about pssm_diff2 score as well.

Classification reported by plam_annot and palmscan are on famlily level. Maybe a higher taxonomy resolution can be got in palmdb?

Thanks!

rcedgar commented 6 months ago

I'm don't understand your question, but it sounds like you want to do taxonomic classification from RdRp. If so, then neither palm_annot nor palmscan are designed for this purpose. For ranks of family and below, you should make MSAs and trees. At order and above, it's somewhere between very difficult and impossible, especially for phylum which I believe is meaningless/uninformative. See this recent paper for some discussion of the issues: https://academic.oup.com/ve/article/9/2/vead063/7382111.

2110537 commented 6 months ago

yes. I want to do taxonomic classification from RdRp, to know which kind of RNA virus in my sample and to find out if there is any new viral candidates. Should I use the palmdb for this purpose? Thank you!

rcedgar commented 6 months ago

To classify RdRp as known / novel at species rank, you need a database of known RdRps, then you can cluster your RdRps with the known at 90% aa identity. Such clusters are sometimes called "vOTUs", these are generally considered reasonable approximation to species. If a vOTU has no known RdRps, then it is reasonable to say that the vOTU is novel. I would argue that the best way to do this is to use palmprints or palmcores, because these are globally alignable segments. If you use contigs or, say, local alignments found using blast, then you may have fragments which overlap partially or not at all, which will exaggerate the number of vOTUs compared to palmprints. However, some published studies cluster fragments (e.g. https://www.science.org/doi/abs/10.1126/science.abm5847).

Getting a database of known RdRps is hard because GenBank does not have consistent annotations (e.g. you cannot do a search for "Rdp-dependent RNA polymerase" and find all known viral RdRp), and GenBank only has a small fraction of RdRps from recent studies. The best source I know of is the latest palmdb, which is so far unpublished: https://drive5.com/palmdb/. This was built in April 2023, so does not include any studies published after that date, and may be incomplete due to datasets we were unaware of.

For genus and family rank, you need to place an RdRp in a phylogenetic tree. If an RdRp falls inside a subtree belonging to a single genus, then you can assign the RdRp to that genus; similarly for family. If an RdRp falls outside all known genera, then it is a candidate for being assigned a novel genus, but it is also possible to extend a known genus to include this species, so this is ambiguous. Same for higher ranks of course.

If the RdRp falls outside a known family, assignment to order is very hard, but perhaps possible if you can make a high-quality MSA and show that the tree is robust by ensemble bootstrap (https://www.nature.com/articles/s41467-022-34630-w). For class and phylum, assignment is somewhere between a very challenging research problem and impossible (for what it's worth, I believe class and phylum are undefined, impossible to classify in practice, and meaningless for all practical purposes).

2110537 commented 6 months ago

understand, thanks for explaining. now I try to trimm the palmcore in my seqs by palm_annot, i get 1369 palmcore seqs from 1401 seqs summaried as below. The RdRp of input 1401 seqs has identified but only 1369 palmcore seqs generated. Does this mean the excluded 32 seqs do not have a RdRp? Thanks a lot.

1401 Sequences 1369 RdRp score == 100 (97.7%) 0 RdRp score >= 90 (0.0%) 10 RdRp score >= 50 (0.7%) 0 RdRp score > 0 (0.0%) 22 RdRp score = 0 (1.6%)

1401 of 1401 (100%) palm sequences reported

rcedgar commented 6 months ago

32 seqs are predicted not have a RdRp, of course no algorithm is perfect. Note they are all (1401 / 1401) predicted to have a palm domain. The hardest step in classification is to discriminate close cellular homologs, in particular reverse transcriptases (RTs) from RdRp. To check the prediction, you could take the 32 sequences and try BLAST, PFAM etc.

2110537 commented 6 months ago

ok! Thank you so much.