pagnani / ArDCA.jl

Autoregressive networks for protein
MIT License
31 stars 8 forks source link

Likelihood calculation #20

Closed MaksimovDenisMIPT closed 2 years ago

MaksimovDenisMIPT commented 2 years ago

Thank you for the great tool!

I would like to calculate the likelihood of a new sequence (which may not be present on initial alignment) using a trained ArDCA model (on Julia). How can this be done correctly?

pagnani commented 2 years ago

Indeed there was no method to do so. I tagged a new version (v0.5.0) where I introduced the loglikelihood method

julia> pl = loglikelihood(vecseq,arnet);

This computes the pseudolikelihood of a sequence code as a vector of integers with the usual encoding

  A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

If you have many sequences you can put them in a N x M matrix where N is the sequence's length and M is the number of sequences, you can use the same syntax

julia> vecpl = loglikelihood(matrseq,arnet);

vecpl is a vector of length M.

If you have a sequence as a String you can pass it as well

julia> pl = loglikelihood(strseq,arnet);

To update the package just ] to enter the package manager and do update

Let me know if you have problems. Be aware that computing the loglikelihood for sequences too different from the data on which the network is trained often produces -Inf values. That's somehow normal for this type of networks.

pagnani commented 2 years ago

Closing for now ... Feel free to reopen in case ....