Persisting checkpoints of CFR solvers

inejc commented 4 years ago

I can't seem to find any support for persisting the state of C++ CFR solver implementations (apologize if it exists and this issue is redundant). I did find pickling support for game and state objects though.

I believe it would be useful to be able to specify where and how often checkpoints (from which self-play can continue at some later time) are saved to disk. I can think of two reasons:

this would make training more robust (e.g. when it takes longer times)
it would allow for usage of interruptible cloud VMs (reduce the cost of training significantly)

I haven't spent much time on this but maybe there are other solvers where this would be useful?

elkhrt commented 4 years ago

Yes, I agree this could be generally useful to deal with preemption. I don't entirely understand the parallelism you're proposing in your second point though.

Contributions welcome!

On Thu, 20 Feb 2020 at 09:29, Nejc Ilenic notifications@github.com wrote:

I can't seem to find any support for persisting the state of C++ CFR solver implementations (apologize if it exists and this issue is redundant). I did find pickling support for game and state objects though.

I believe it would be useful to be able to specify where and how often checkpoints (from which self-play can continue at some later time) are saved to disk. I can think of two reasons:

this would make training more robust (e.g. when it takes longer times)

it would allow for usage of interruptible cloud VMs (reduce the cost of training significantly)

I haven't spent much time on this but maybe there are other solvers where this would be useful?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/deepmind/open_spiel/issues/149?email_source=notifications&email_token=AHAF7TCDZDODXZGPGFIRXZ3RDY5QFA5CNFSM4KYJX7Z2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IO5AOSQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHAF7TAQ7JK73FNKQJI3BUTRDY5QFANCNFSM4KYJX7ZQ .

inejc commented 4 years ago

Just saying that this feature would enable training on virtual machines that might be terminated at any time but are offered at much lower prices (e.g. GCP preemptible instances or AWS spot instances).

In any case, I will look into this and use this thread for potential questions.

inejc commented 4 years ago

A couple of follow up questions:

I see there is also no support for the serialization of Policy for later inference. Is persistence in C++ handled somehow differently, e.g. with external dependencies?
Why do comments above the AveragePolicy methods for CFR solvers say that the returned policy instance should only be used during the lifetime of the solver object? As far as I can see a CFRAveragePolicy object only contains info_states_ which is passed by value and uniform_policy_ which is passed as a shared pointer, so both instances should exist after the solver object is destroyed?
ExternalSamplingMCCFRSolver and OutcomeSamplingMCCFRSolver keep track of an explicit, tabular uniform_policy_ for the default policy which seems like a lot of overhead for large games and even making the implementation infeasible for games with sampled stochastic chance nodes. It would be nice if CFRAveragePolicy's default_policy_ would be a shared_ptr<Policy> and the above CFR solvers would have an additional constructor where an arbitrary Policy uniform_policy_ would be passed in (e.g. a UniformPolicy). Does that make sense or am I missing something?
Are there any other obvious things I should look into when designing and planning a large game to be used with ExternalSamplingMCCFRSolver?

If the answer to 1. is no, I think the first step for me to explore should be the serialization of Policy. This also seems like the first step towards the ability to save checkpoints as some of the CFR solvers contain explicit, default policy objects.

lanctot commented 4 years ago

A couple of follow up questions:

I see there is also no support for the serialization of Policy for later inference. Is persistence in C++ handled somehow differently, e.g. with external dependencies?

Right. We haven't used this for research that much yet since we had internal implementations. And when we did we did not really need checkpointing.

Should be really easy to add this, though. The python to C++ tabular policy conversion is already doing the necessary conversion, we just need to write the functions to convert to/fro strings.

Why do comments above the AveragePolicy methods for CFR solvers say that the returned policy instance should only be used during the lifetime of the solver object? As far as I can see a CFRAveragePolicy object only contains info_states_ which is passed by value and uniform_policy_ which is passed as a shared pointer, so both instances should exist after the solver object is destroyed?

CFRAveragePolicy constructor takes in a const ref snd assigns it to a const ref. Why do you think info_states_ is passed by value?

ExternalSamplingMCCFRSolver and OutcomeSamplingMCCFRSolver keep track of an explicit, tabular uniform_policy_ for the default policy which seems like a lot of overhead for large games and even making the implementation infeasible for games with sampled stochastic chance nodes. It would be nice if CFRAveragePolicy's default_policy_ would be a shared_ptr<Policy> and the above CFR solvers would have an additional constructor where an arbitrary Policy uniform_policy_ would be passed in (e.g. a UniformPolicy). Does that make sense or am I missing something?

I think it was an easy fix for the case where states were not sampled so you need to assume something there. Taking in a Policy instead I don't think fixes it because TabularExploitability currents calls the policy with an infostate string (not a state) so there is no way to get legal actions. But I agree there is probably a better way, maybe just a flag in the TabularExploitability that lets it assume uniform policy over legal actions when the lookup fails. That way we don't have to store anything extra.

Are there any other obvious things I should look into when designing and planning a large game to be used with ExternalSamplingMCCFRSolver?

Not sure what you mean.. can you be more specific? What is the use case?

lanctot commented 4 years ago

Follow up for your third point. We should probably just change the query of the policy in TabularExploitability to be state-based. That way we can make the change you suggested. I vaguely remember that being awkward or impossible at the time but do not remember why... possibly to do with performance, but I would he happy to fix this especially if it just means an extra flag gets passed to TabularBestResponse.

inejc commented 4 years ago

@lanctot I think a policy is already queried with the state object within tabular exploitability? I might be looking in the wrong place though, i.e. https://github.com/deepmind/open_spiel/blob/84883f1ccc9038309313d0c561ad87b17e35c1d8/open_spiel/algorithms/best_response.cc#L82 In any case, I opened https://github.com/deepmind/open_spiel/pull/154 with what I had in mind, let's continue this conversation there.

CFRAveragePolicy constructor takes in a const ref snd assigns it to a const ref. Why do you think infostates is passed by value?

My bad, I was looking at the wrong piece of code.

Not sure what you mean.. can you be more specific? What is the use case?

It's not a particularly good question... I guess I'm trying to figure out what the potential memory/computational bottlenecks are when planning to run an MCCFR solver on a game that is larger than Bridge. The issue that is addressed in the opened PR is one such instance for example.

lanctot commented 4 years ago

@lanctot I think a policy is already queried with the state object within tabular exploitability?

Yes, you are right. But there is a called to ExpectedReturns which currently assumes the policy implements GetStatePolicy(const std::string& infostate_str). And when I change the default policies to be UniformPolicy (in policy.h) the tests fail fails in the call to this function. It should be easy to fix: we can just allow the best response code to call expected returns with the state-based queries (which btw it already supports), we just have it set to the other one currently for performance reasons.

Not sure what you mean.. can you be more specific? What is the use case?

It's not a particularly good question... I guess I'm trying to figure out what the potential memory/computational bottlenecks are when planning to run an MCCFR solver on a game that is larger than Bridge. The issue that is addressed in the opened PR is one such instance for example.

Great, your PR should get you out of having to store that policy, which was the only inefficiency at the moment. It should run on large games now without a problem, but you won't be able to call the best response (which I guess you won't need). However, keep in mind every iteration of external sampling visits about around sqrt(|Z|) terminal histories (for a two-player balanced game), so if your game is very large, it might not even be able to complete multiple iterations (and will add a lot of the game to memory).

Outcome sampling, on the other hand, only does one trajectory from start to finish, so only adds O(|trajectory|) data to the hash table each iteration.

Two papers worth considering:

Online outcome sampling defines an MCTS-like application of OS: http://mlanctot.info/files/papers/aamas15-iioos.pdf
New MCCFR bounds: very recent paper showing improvements on the MCCFR theory: https://arxiv.org/abs/2002.08493

If your game is large enough, then you might want to do some reinforcement learning with function approximation rather than use tabular methods (or maybe you're looking into combining these?) I can point you to some relevant papers on the subject!

lanctot commented 4 years ago

You can also do game-specific abstractions (grouping information states to shrink the state space), which have worked quite well will sampling / tabular methods in the past for Poker AI.

inejc commented 4 years ago

First of all, @lanctot, thank you for all the suggestions. I'm interested in combining CFR with RL so I would appreciate the relevant literature regarding this as well.

As far as the original issue is concerned, here are some comments/questions:

I plan to add (de)/serialization to CFRSolverBase, OutcomeSamplingMCCFRSolver and ExternalSamplingMCCFRSolver
we need to be able to serialize CFRInfoStateValuesTable and Policy
I think we can skip serialization of std::mt19937 within the two MCCFR algorithms
we need to be able to serialize AverageType for external sampling MCCFR
due to TabularBestResponse I want to skip adding serialization support to CFRBRSolver in the first iteration
given that game and state serialization is string-based rather than binary, we'd ideally use the same approach

Please let me know if you have any comments regarding the points above.

lanctot commented 4 years ago

Hi @inejc , sorry for the late response. This all seems good to me.

3. I think we can skip serialization of `std::mt19937` within the two MCCFR algorithms

I think it should be easy to "serialize" the RNG by just saving its current state and reloading it? Seems like this could be one line, see here: https://stackoverflow.com/questions/27727012/c-stdmt19937-and-rng-state-save-load-portability/27728971#27728971 . I agree it's maybe not the highest priority so we can add it if after your contribution if it turns out to be harder than it looks.

5. due to `TabularBestResponse` I want to skip adding serialization support to `CFRBRSolver` in the first iteration

Sorry, I don't understand. Can you elaborate?

6. given that game and state serialization is string-based rather than binary, we'd ideally use the same approach

I agree. I like havingreadable text-based serialization as the default. If people want to compress the text, or write custom serializers, they're welcome to.

I'm interested in combining CFR with RL so I would appreciate the relevant literature regarding this as well.

You've definitely come to the right place. :) Here is the list of papers I recommend:

DeepStack: https://www.deepstack.ai/
RCFR: https://arxiv.org/abs/1411.7974 (see also @dmorrill10 's thesis)
Regret policy gradients: https://arxiv.org/abs/1810.09026
Exploitability descent: https://arxiv.org/abs/1903.05614
Neural replicator dynamics: https://arxiv.org/abs/1906.00190
Deep CFR: https://arxiv.org/abs/1811.00164 (our implementation is not super well-tested, however; could use a look through if you're interested)
Double Neural CFR: https://openreview.net/forum?id=ByedzkrKvH&noteId=ByedzkrKvH . One of the authors may currently be using OpenSpiel. @lihuiknight, are you one of the DNCFR authors?
Recent RCFR follow-up: https://arxiv.org/abs/1912.02967
Follow the regularized leader, regularization, and convergence in RL for imperfect info games: https://arxiv.org/abs/2002.08456

I may have missed some and I'm sure others will be sure to let me know which ones ;-)

KK666-AI commented 4 years ago

Dear @lanctot , I am the author of Double Neural CFR. I am using open_spiel. I think open_spiel is a good environment and it will benefit a lot for both game theory and reinforcement learning communities.

lanctot commented 4 years ago

Dear @lanctot , I am the author of Double Neural CFR.

I would love to see an implementation of DNCFR in OpenSpiel... it's ok if it is based on PyTorch. :)

lanctot commented 4 years ago

Follow up for your third point. We should probably just change the query of the policy in TabularExploitability to be state-based. That way we can make the change you suggested. I vaguely remember that being awkward or impossible at the time but do not remember why... possibly to do with performance, but I would he happy to fix this especially if it just means an extra flag gets passed to TabularBestResponse.

Quick follow-up, I've done this now in https://github.com/deepmind/open_spiel/commit/c2be1951587b2879b04ef986706652275256081a and changed the implementations of the Monte Carlo CFR so that they don't have to store an explicit uniform policy.

inejc commented 4 years ago

Thanks for the follow-up. I shall finally find the time to tackle the issue in the next weeks. Regarding my point 5. above, I just wanted to check how much of an additional effort would go into serializing TabularBestResponse and that I might do that in a separate PR.

And thank you very much for all the paper recommendations!

lanctot commented 4 years ago

Regarding my point 5. above, I just wanted to check how much of an additional effort would go into serializing TabularBestResponse and that I might do that in a separate PR.

Sorry, I still don't understand the original point you made (point 5 in your March 10th response). Can you explain what you mean? What requires special consideration in CFR-BR? Do you mean TabularPolicy instead of TabularBestResponse ? .. if so, we can very quickly write serializer + deserializer for TabularPolicy and I think it'd be a good thing to have anyway.

lanctot commented 4 years ago

Oh I just noticed this: https://github.com/deepmind/open_spiel/blob/c2be1951587b2879b04ef986706652275256081a/open_spiel/algorithms/cfr_br.h#L43 I guess that can be changed now to just UniformPolicy... I'm not sure why that's a tabular policy, I may have incorrectly assumed our best response algorithm required the policies to be tabular.

Do you mean the best_response_computers_ on the next line? Those should not hold any state and should be able to be re-created fresh when loading from checkpoint.

inejc commented 4 years ago

@lanctot I had best_response_computers_ in mind since I assumed there is some state kept there. Anyway, this simplifies things then. Regarding TabularPolicy uniform_policy_, I think due to the fact that MCCFR instances keep Policy I will need to add serializers for Policy in any case.

I will also try changing TabularPolicy uniform_policy_ to UniformPolicy uniform_policy_ in CFRBRSolver.

lanctot commented 4 years ago

Closing this as we are actively integrating https://github.com/deepmind/open_spiel/pull/222

google-deepmind / open_spiel

Persisting checkpoints of CFR solvers #149