rcedgar / palmscan

GNU Affero General Public License v3.0
25 stars 5 forks source link

how to use palmscan #2

Open 2110537 opened 12 months ago

2110537 commented 12 months ago

hello! I am a very new user of your palmscan. I downloaded the source files and use g++ to run the palmscan2_main.cpp file. but i got the following error messages:

/usr/bin/ld: /tmp/ccwgGtDf.o: warning: relocation against optset_cluster_ppm' in read-only section.text' /usr/bin/ld: /tmp/ccwgGtDf.o: in function DiePtr(char const*, unsigned int)': palmscan2_main.cpp:(.text+0x4b): undefined reference toLog(char const*, ...)'

could you please give me some informations on how to run it? Any response is welcomed!

Thanks

rcedgar commented 12 months ago

You should build using the Makefile, it will build a binary with filename palmscan2 in directory../bin. If you run the binary with no options, it will give a brief usage message. Note that the usage is incomplete, e.g. if you want to run old-style PSSMs this is not explained, if you let me know more about what you want to do with palmscan2 I can provide more details.

2110537 commented 12 months ago

Thanks for your kind reply! I have run palmscan successfully following your instructions. After HMMscan of RdRp_HMM database, I want to validate the RdRp-like candidates through palmscan by detacting of A, B, and C motifs of RNA viruses. I will appreciate it if you can give me some advice. Thanks a lot!

rcedgar commented 12 months ago

If you run palmscan2 with no arguments, it will give a brief usage on how to run PSSM search. Distinguishing viral RdRp from close homologs is very tricky, because non-viral homologs can be very similar, especially Group II introns. I suggest you look at the methods in this paper, especially the supplementary notes: https://doi.org/10.1093/ve/vead063. The PSSM-based algorithm in palmscan is good for finding motifs but not so good at classifying viral RdRp vs. other genes because the false-positive rate (FP) is quite high, you need an HMM E-value or other check that the sequence is viral RdRp. Also, note that HMMs can find local alignments which lack some or all of the palmprint, these will be missed by the PSSM-based methods in palmscan. The HMMs in palm_annot are similar to RdRp-scan HMMs with the additional feature that match states for motifs A, B and C are annoted, so you get the advantages of HMMs (more sensitive, fewer FPs) combined with the ability to trim hits to a globally-alignable segment for clustering into OTUs.

mihinduk commented 11 months ago

Hi, I have run palmscan using this command: palmscan2 -search_pssms darkmatterProteins.fasta -tsv darkmatterProteins2_hits.tsv

Could you please help me understand how to interpret the output? I understand that high quality hits have a score >= 20, but am unclear about the other columns:

Label Score Group Group2 Diff2 ABC QL Lo Hi PPL Suff PosA SeqA PosB SeqB PosC SeqC contig_113079_1 82.6 Tymovirales Kitrino 26.7 ABC 220 6 100 95 120 6 TESDYEAFDASQ 62 SGEASTFLFNTMAN 93 FAGDDMCA contig_219610_1 61.9 Picornavirales Duplorna 5.7 ABC 343 165 267 103 76 165 FAFDYTGYDASL 223 SGCSGTSIFNSMIN 260 AYGDDVIA

  1. What is the difference between Group and Group2?
  2. Is there a list of all groups somewhere?
  3. Is there an explanation of how to interpret the Diff2, QL, Lo, Hi, PPL and Suff columns?

Thank you very much for your help, Kathie Mihindukulasuriya

rcedgar commented 11 months ago

Group is the PSSM group with highest score. Group2 is the PSSM group with second-highest score. Diff2 is the difference in score between Group (top hit) and Group2 (second hit). If Diff2 is large, this suggests that the query has the same taxonomy as the top PSSM group. If DIff2 is close to zero, this suggests that the taxonomy is ambiguous, the query might actually be closer to Group2 in a tree. Of course, these are rough heuristics only but can be handy when reading large reports. QL = query length Lo = palmprint start Hi = palmprint end PPL = palmprint length Suff = number of query letters after palmprint

To get a list of groups: grep group palmprint_pssms.cpp

mihinduk commented 11 months ago

Thank you, so much!

rcedgar commented 11 months ago

you're welcome -- my bad for producing such sparse documentation :-)

2110537 commented 11 months ago

If you run palmscan2 with no arguments, it will give a brief usage on how to run PSSM search. Distinguishing viral RdRp from close homologs is very tricky, because non-viral homologs can be very similar, especially Group II introns. I suggest you look at the methods in this paper, especially the supplementary notes: https://doi.org/10.1093/ve/vead063. The PSSM-based algorithm in palmscan is good for finding motifs but not so good at classifying viral RdRp vs. other genes because the false-positive rate (FP) is quite high, you need an HMM E-value or other check that the sequence is viral RdRp. Also, note that HMMs can find local alignments which lack some or all of the palmprint, these will be missed by the PSSM-based methods in palmscan. The HMMs in palm_annot are similar to RdRp-scan HMMs with the additional feature that match states for motifs A, B and C are annoted, so you get the advantages of HMMs (more sensitive, fewer FPs) combined with the ability to trim hits to a globally-alignable segment for clustering into OTUs.

Hi rcedgar,

Thank you for your kind reply. It is a great help to me.