Sampling Set as in ApproxMC

sirwhinesalot commented 2 years ago

Would it be possible to add sampling set support to CMSGen like in ApproxMC? In the problems I typically work with only a subset of variables is relevant while the rest are just auxiliary.

msoos commented 1 year ago

Hi,

First of all, sorry for the incredibly late response -- I haven't seen this issue and so I didn't even know it was there! Ayyy. Sorry. Anyway... it's not so easy. I guess it can be done, but I am not sure how good it would be ? I mean, I can hack it, but the quality of the samples may be bad. Likely it'd still be miles and miles above all the other almost-uniform samplers. But I am always afraid about quality, it's kinda hard to measure. I'd need to measure it with ScalBarbarik (https://github.com/meelgroup/scalbarbarik/), which is our tool, so that's not impossible, but some extra work, and AFAIK it doesn't yet support projection :sob:

Anyway. What do you need it for? Maybe if you explain, I can perhaps think about it? The other option is to make sure that your auxiliary variables are FULLY determined by your non-auxiliary variables. Then, they don't matter, so the samples would look the same either way. In other words, if the set of solutions is EXACTLY the same when projected over the non-auxiliary variables as not projected over them, then you can use CMSGen as-is and ignore the values over the auxiliary variables.

I hope the above helped. Let me know a bit more detail, and I'll maybe work on this. I'll keep this open in the meanwhile. I am super-sorry for the late answer. I usually respond within a day or so... OOps. Cheers,

Mate

sirwhinesalot commented 1 year ago

Hi Mate,

No worries about the late response, it's nothing critical I was just playing around with CMSGen. I work on hardware synthesis problems (which boil down to representing the presence of connections and components as boolean variables) and I was investigating different technologies and approaches to find all possible solutions (or at least a nice large uniform sample).

The thing is, the actual constraints for the problem use a lot of pseudo-boolean/cardinality/symmetry-breaking constraints, which require tons of auxiliary variables in SAT. If you're curious right now the winning solver/tech is Gecode (CP) which finds all solutions to the problem faster than ApproxMC can estimate them :)

msoos commented 1 year ago

Nice! Thanks for all the details, sounds interesting :) Can you send me some of these problems? Maybe I can check what's up with ApproxMC? Also, AppMC has been SIGNIFICANTLY improved the past 8 months. It should be about 100x faster on average. Maybe you can check it once more? Either way, I'd be curious of the instances! I'd love to improve AppMC on them :)

Mate

On Thu, Nov 24, 2022, 10:49 César Santos @.***> wrote:

Hi Mate,

No worries about the late response, it's nothing critical I was just playing around with CMSGen. I work on hardware synthesis problems (which boil down to representing the presence of connections and components as boolean variables) and I was investigating different technologies and approaches to find all possible solutions (or at least a nice large uniform sample).

The thing is, the actual constraints for the problem use a lot of pseudo-boolean/cardinality/symmetry-breaking constraints, which require tons of auxiliary variables in SAT. If you're curious right now the winning solver/tech is Gecode (CP) which finds all solutions to the problem faster than ApproxMC can estimate them :)

— Reply to this email directly, view it on GitHub https://github.com/meelgroup/cmsgen/issues/1#issuecomment-1326205838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKF4OLQ3NZAJKFL3A3VVJDWJ423ZANCNFSM5M3S2DOA . You are receiving this because you commented.Message ID: @.***>

sirwhinesalot commented 1 year ago

Can't provide anything at the moment (can't send work examples), but I'll send some personal examples your way when I can.

sirwhinesalot commented 1 year ago

Hi again Mate,

I'm not sure what the best way to send you the problem is, but the nqueens problem is very representative of the issue. With cpmpy:

def test_queens_gecode():
    n = 13
    s = SolverLookup.get("minizinc:gecode")
    queens = boolvar((n, n), "q")
    # exactly 1 queen per row
    for i in range(0, n):
        s += sum(queens[i, :]) == 1
    # exactly 1 queen per column
    for i in range(0, n):
        s += sum(queens[:, i]) == 1
    # max 1 queen per \ diagonal
    for i in range(-(n - 1), (n - 1) + 1):
        s += sum(queens.diagonal(i)) <= 1
    # max 1 queen per / diagonal
    mirrored = np.flipud(queens)
    for i in range(-(n - 1), (n - 1) + 1):
        s += sum(mirrored.diagonal(i)) <= 1
    assert(s.solveAll() == 73712)

Takes 30 seconds on my M1 pro mac running on battery to compute all solutions (not just count them).

Using pysat to model the problem + approxmc to do approximate model counting:

def test_nqueens_approxmc():
    n = 13
    queens = np.arange(1, n * n + 1).reshape((n, n))
    cnf = pysat.formula.CNF()
    vpool = pysat.formula.IDPool(n * n + 1)
    card = pysat.card.CardEnc
    for i in range(0, n):
        cnf.clauses.extend(card.equals(queens[i, :].tolist(), 1, vpool=vpool))
    # exactly 1 queen per column
    for i in range(0, n):
        cnf.clauses.extend(card.equals(queens[:, i].tolist(), 1, vpool=vpool))
    # max 1 queen per \ diagonal
    for i in range(-(n - 1), (n - 1) + 1):
        cnf.clauses.extend(card.atmost(queens.diagonal(i).tolist(), 1, vpool=vpool))
    # max 1 queen per / diagonal
    mirrored = np.flipud(queens)
    for i in range(-(n - 1), (n - 1) + 1):
        cnf.clauses.extend(card.atmost(mirrored.diagonal(i).tolist(), 1, vpool=vpool))
    approxmc = pyapproxmc.Counter()
    for c in cnf.clauses:
        approxmc.add_clause(c)
    c, h = approxmc.count(queens.flatten().tolist())
    count = c * 2 ** h
    assert(math.isclose(count, 73712, rel_tol=200))

This takes over 6 and a half minutes to compute an approximate solution.

Perhaps the issue is only for "small" N? At which solution count should I expect ApproxMC to beat a CP solver? (it takes minisat 3 minutes to compute all solutions to this problem, for comparison)

msoos commented 1 year ago

Oh, yeah. Highly symmetric problems are gonna give you a lot of trouble. What you can do, though, is figure out on which axis the system is symmetric, and encode only one of the K versions. Then you can multiply the solution by K. Think of it this way: you have the Rubik's cube, and you want to count the number of different configurations. Well, you can count all configurations when green is a the top, and then multiply the count you get by 6 -- since obviously you could put the green on any of the 6 sides. This sounds trivial/simple, but of course you can do more. Once you fixed the green on the top, you can also fix one of the colors on the side -- this gives a factor of 4. So now you have to multiply by 64. These things add up very quickly (since it's multiplication), and they will radically improve your performance. In this case, it will be much more than 64=24x faster. I'm expecting it to be ~100x faster.

If you dig yourself a bit into symmetries, the above can also be automatically found for you by tools. Ther is one that I maintain (but which I didn't write in large part): https://github.com/meelgroup/breakid You run it on a CNF and it will spit out the symmetry groups. In fact, that repository has a few N-queen coloring problems, see examples/coloring/cnfs/queen*.

Let me know if the above helped,

Mate

sirwhinesalot commented 1 year ago

That sounds like a great idea actually. I didn't think of it. But at least for nqueens (and similar), a lot of symmetries can be removed by lex-leader constraints. I left them out of the example as they would make it a lot bigger. That said, this means that for pure counting, rather than adding such constraints, I can just lock-in certain choices and remember to multiply the solution count as you describe.

I'll mess around with breakid once I find the time, many thanks!

msoos commented 1 year ago

Yep, exactly! This can make (and DOES make) a HUGE difference. Like, things that would take years would finish in seconds. You'll see.

Good luck, and I'm glad I could help,

Mate

meelgroup / cmsgen

Sampling Set as in ApproxMC #1