padlocbio / padloc

Locate antiviral defence systems in prokaryotic genomes
MIT License
43 stars 9 forks source link

Use the GA score when available #41

Open jpjarnoux opened 2 months ago

jpjarnoux commented 2 months ago

Hi I'm comparing results with DefenseFinder, and for some systems, for example, SoFic, you are not using the GA score. If I don't misunderstand this HMM and system is coming from DefenseFinder model and hmm. In this case, my predictions differ from this model, and I can generalize them to other HMM and systems from DefenseFinder. So it could be good to launch HMMSearch 2 times and concat the results. Or maybe I miss something and the results are expected to be different. Thanks

leightonpayne commented 1 month ago

Hey Jérôme,

Apologies for the delayed reply. For now, this is the expected behaviour. We are having ongoing discussions about converting DefenseFinder gathering thresholds (GAs) to E-values (for compatibility with PADLOC), or incorporating GA into PADLOC's system identification, but I'm not promising it will be implemented any time soon.

If you wanted a more direct comparison with results from DefenseFinder you could do some post hoc filtering of the .domtblout files generated by PADLOC (e.g. to drop any hits with bitscore below the GA scores that were pre-assigned in the HMM file) then delete the _padloc.csv files and re-run PADLOC. It will redo the system identification step by reading in the edited .domtblout files that already exist.

That being said, for some of these systems there are not enough experimentally verified examples to form a useful 'ground truth' for determining appropriate scoring cutoffs, so use your best judgement when determining what looks like a legit hit or not!

Cheers

jpjarnoux commented 1 month ago

Hi, Thanks for your reply. Indeed, I now annotate the HMM separately from Defensefinder and PADLOC and join the result before the system identification. I agree that there is no 'ground truth,' I hope discussion and more experimental verification will bring us better results that we can compare to question the method rather than the HMM. Thanks again