pagnani / ArDCA.jl

Autoregressive networks for protein
MIT License
31 stars 8 forks source link

add pseudocount 1/M/10 to p0 to avoid sequence with probability 0 #32

Closed PierreBarrat closed 7 months ago

PierreBarrat commented 7 months ago

Sometimes, I want to evaluate the ArDCA-likelihood of a sequence that has not been seen in training. Since ArNet.p0 is not currently regularized, this can cause sequences to have a probability strictly 0. In some applications, this causes issues: for instance in ancestral sequence reconstruction, if a sequence at one leaf is assigned likelihood 0, every reconstruction above must have likelihood 0, causing algorithms to fail.

This PR just adds a pseudocount 1/M/10 to p0, where M is the number of sequences in the training alignment. Do you think it makes sense?

PierreBarrat commented 7 months ago

Added a pc field to ArVar, with possibility to pass it as a keyword argument when using the main ardca method. Since ArVar is what is passed to computep0, users can set the pseudocount in this way. The only problem is that tests do not pass now, trying to understand why

PierreBarrat commented 7 months ago