Could we replace speed with a distributed system?

mikepound / opencubes

A community improved version of the polycubes project!

MIT License

45 stars 22 forks source link

Could we replace speed with a distributed system? #1

Open matthunz opened 1 year ago

matthunz commented 1 year ago

Instead of trying to compact the memory layout of the cubes is it possible to use a shared database to store the hashes of each shape? This almost feels like a blockchain style problem where everyone tries to guess the next hash, in this case it's just the next shape.

We could even make this a big distributed Numberphile viewer calculation!

bertie2 commented 1 year ago

distribution is almost certainly the name of the game if we want to get much beyond n = 16.

just to confirm when you say "database" do you mean off the shelf database software or something custom?

off the shelf database software will have a hard time due to the truly MASSIVE read loading involved in finding a new cube, if someone wants to implement it and give it a benchmark I would be very interested to see, I might well be wrong, but I suspect at least a partial large cache on the client end will be mandatory for good performance.

something custom which hands out chunks of work similar to seti@home however is what my current work is leading towards.

matthunz commented 1 year ago

That's a great point, I was thinking just like an SQL db but having a local cache would save a lot of time and bandwidth

I wonder how that could work though. Handing out chunks seems hard because there's no clear way I can see to shard the data. If a peer generated a new shape I feel like our only option would be to read from the entire set of previous shapes.

EDIT: Actually we would only have to check shapes with the same number of cubes, maybe we could send a subset of the data with shapes of only that size

bertie2 commented 1 year ago

for context, when i tried to do n=14 it did this: ahhhh-my-ram and that was just trying to store the n=14 cubes.

yes we should partition data based on cube size, but we will also need to either massively improve density with which we can store cubes or further partition it to clients so that the data is manageable. if you have a canonical hash for a given cube, you can actually just have the clients work from partial data sets, they will create a bunch of duplicate cubes but they weed out 90% of cubes and you merge the last 10% on the server.

evanrelf commented 1 year ago

Might a bloom filter or cuckoo filter be useful for this?

bertie2 commented 1 year ago

if we can come up with a faster way to rotate the cubes, or a fast orientation independent hash function then yes. either would make an excellent first pass to narrow down the search space for the actual full checks.

~~otherwise for now, most of the time is spent rotating the cube all over the place to check for duplicates in different orientations, not in the actual set lookups to check for duplicates themselves.~~

sorry in fact I feel I should clarify, that most time right now is spent on a combination of several steps to manipulate the cubes, rotating, cropping, hashing. its still not just the set lookups which are the problem but its more complex than my initial comment made out.

finally georgedorn at https://github.com/mikepound/cubes/pull/5 came up with a great way to reduce these steps by allocating bigger sets, but my current implementation is memory and memory bandwidth limited at n=13+ and so I have not currently included them in my testing.

mikepound commented 1 year ago

If there was a standard storage implementation for cubes, and then a way to order these, could we have all the cubes at n-1 sorted and indexed. Then a distributed node could calculate e.g. all the new cubes from indexes 0 - 10000, i.e. all the unique shapes created by adding one cube to all the polycubes at n-1 from indexes 0 - 10000. Then you would have a series of sets from all nodes, and @evanrelf's suggestion of bloom or cuckoo could then be used to find the unique final set from these?