Add exact score based algorithm for causal structure learning

mwien commented 8 months ago

Implemented an exact score based causal-structure/Bayesian-network learning algorithm. For any decomposable scoring function it computes the CPDAG with the optimal score.

The implementation is based on https://arxiv.org/pdf/1206.6875.pdf. The main workload of the algorithm is to compute the local score for all variables and every possible parent set (amounting to $n*2^{n-1}$ many evaluations of the scoring function). The algorithm likely scales to 20-25 variables (then memory will become an issue; we could address this by storing some of the preprocessing to disk), but I haven't done thorough testing yet. This covers many practical settings.

There are many other exact score based algorithms as well. As far as I'm aware, all of them need $n2^{n-1}$ score evaluations in the general case, thus for our purposes implementing one algorithm should suffice (the algorithms have their pros/cons mostly when the maximum number of parents $k$ is set beforehand, reducing the number of score evaluations to $n 2^{k}$, however I find this rather artificial; still we could add this easily).

Currently I only used the algorithm for an oracle score, we need to integrate it with the BIC score from GES.

mschauer commented 8 months ago

Yeah that’s nice. Only 150 lines of code is also nice. The implementation by @tomisilander is also on GitHub https://github.com/tomisilander/bene

mwien commented 8 months ago

Seems like the algorithm runs in principle, I added a modifed version of the test using the Gaussian score with the true covariance matrix ~~which failed~~ which we used for GES and it runs through. And it also seems to work in an oracle setting.

But I haven't done super extensive testing. Might be interesting to compare the performance to PC/GES on sample data. Maybe gonna do this the next weeks if I find the time.

Also, one minor thing: in contrast to the GES algorithm, I passed a local_score(parent, vertex) function to the algorithm instead of the score struct. But we can also do it like in GES...

mschauer commented 8 months ago

The test error looks unrelated…

mwien commented 8 months ago

Yeah sorry. I broke some stuff by accident but tests run through now... I hope

The tests suggest that the algorithm finds the true CPDAG when using a naive oracle score (size of symmetric difference between parents and true parents for finding the DAG, then compute the CPDAG of this DAG) and when using the true covariance matrix in combination with the Gaussian score. So I'm reasonably optimistic that the implementation is correct.

Maybe just some sanity check on real data before merge...

Also has option to run in parallel and that appears to work nicely as well 🥳

I tried it up to 23 variables and that runs fine in a few minutes on my not-so-new machine, which I think is awesome because it includes many practical use cases. The main problem for larger graphs is going to be memory (the algorithm stores a few tables of size 2^n and n*2^(n-1) which contain 64 bit floating point numbers, that easily goes to double digit GBs for n=25 or larger). Silander in his implementation writes the tables to disk to scale it to n=29 or something, but I think we are fine with just strongly discouraging using it for n=25 or larger....

tomisilander commented 8 months ago

Hi guys, a small comment if you may. Saving some tables to the disk is not just for scaling up, but also useful to be able to score any, even all for small number of variables, structures quickly later. Often times the most probable (in case of BD-scores of fNML) structure is not very probable at all, so it is useful to look a the other probable models as well. Or, with large sample sizes, getting the second most probable structure may allow to have a strong lower bound the most probable structure. Anyway, good job guys!

mwien commented 8 months ago

Hi guys, a small comment if you may. Saving some tables to the disk is not just for scaling up, but also useful to be able to score any, even all for small number of variables, structures quickly later. Often times the most probable (in case of BD-scores of fNML) structure is not very probable at all, so it is useful to look a the other probable models as well. Or, with large sample sizes, getting the second most probable structure may allow to have a strong lower bound the most probable structure. Anyway, good job guys!

Thanks a lot! I really enjoyed implementing the algorithm. Not often that you can just start coding up the pseudocode in the paper and it works immediately :)

That's true, maybe gonna add writing to disk as an option in the future...

mschauer / CausalInference.jl

Add exact score based algorithm for causal structure learning #128