Initialize and support very large systems

jglaser commented 4 years ago

Description

We would like to enable simulations of very large systems (>4B particles), as well as their initialization. In a two stage approach, we should address breaking changes for the user in 3.0, and implement the changes to actually support these systems in subsequent releases.

Proposed solution

Here's a potentially incomplete list of issues that we need to tackle:

Breaking changes (tentatively for 3.0):

particle tags need to be uint64_t (local indices per rank are kept as uint32_t)
the snapshot should store a lookup table (tag_map) from particle tags into snapshot indices, using a std::map<>, which on the python side should get exported with dict() like interface using pybind11's STL bindings. While access in snapshot order would still be possible, converting the snapshot positions, e.g., into tag order would work like this:
```
pos_sorted = snap.pos[np.array(sorted(tag_map.values()))]
```
This also solves the problem of finding particles in snapshots for simulations in which particles are removed
bond, angle dihedral,.. tag should be uint64_t as well
the GSD file spec should be updated (2.0) to support the new wide tags, with backwards compatibility to read in gsd 1.x files (file sizes may be slightly larger)

Full support for large systems (later releases):

internally, the tag lookup in CUDA kernels should be implemented using a hash table. Currently it is implemented as a 1:1 hash (rtag), which is cheap to construct but consumes O(N) memory. We expect that the hash tab lookup would be no more difficult than a binary search in the kernel, but construction of the tree data structure may be slower (but happen infrequently).
All other classes which break O(N/P) memory scaling should be treated similarly (neighbor list exclusions, ..)
partial snapshots for initialization, every rank would obtain it's local snapshot which is specific to a domain decomposition
currently on the snapshot on rank zero is non-empty. We could leave this as the default, but return partial snapshots with take_snapshot() only when the local=True option is supplied. We could even implement and document the option in 3.0 and make it produce an error until fully implemented.
the domain decomposition as well as global box should be stored in the snapshot(s?), to ensure that fast loads are possible when the system is reinitialized with the same grid. If the number of ranks and/or the domain decomposition layout are different, a scatter/gather approach should still enable initialization from partial snapshots, at the cost of a redistribution of particles
parallel file IO, we should look into what is necessary to enable parallel I/O in gsd (if there are any breaking changes required, we should identify those and prioritize them)

Additional context

@joaander, @jglaser and @InnocentBug discussed this in the context of enabling very large scale simulations of polymeric systems, but these changes will of course be completely general.

Developer

Yes, will contribute. I welcome feedback and additional considerations I forgot to include.

mphoward commented 4 years ago

Sounds interesting, especially the partial snapshots. I'm wondering if there are also potentially subtle issues to watch out for if there are mixed uses of signed and unsigned ints related to the particle tags. Hopefully any cases of this are mostly done on the local indexes rather than tags, though, so that these can be ignored.

joaander commented 4 years ago

Random seeds that use particle tags will need to be updated as well. Philix4x32 takes in 24 bytes of seeds. DPD is already using 24 and we also need to go to 64 bit timestep counters. I haven't looked at other cases where we seed using particle tags, but DPD is probably the worst case as it needs 2 tags.

One solution (for both timestep and tags) would be to store the values in 64-bit quantities but only allow values up to a certain maximum. Say we only allowed up to 40 bits (1 trillion particles / or time steps). We could use fewer bits to identify the unique RNGs in HOOMD, 16 should be enough, and mix the high bits of the tags tag into this seed. We could also do the same with the internal counter, but I'd be concerned about cases where the internal counter is used to generate a large stream. We certainly can't limit it to only 16 bits, but maybe 24 (16 million) is enough? We could also limit the user seed input to less than 32-bits to make room for the additional bits from timestep. Would 16 million user seeds be sufficient?

@jglaser suggested feeding the output of one RNG into another to combine more seeds. I hesitate to go this route without extensive testing as it might create subtle correlations.

mphoward commented 4 years ago

Very good point. I prefer the bit mixing scheme to chaining up RNGs, as the behavior of the bit mixing is probably much easier to debug and define than it would be to try to find correlations between RNGs.

I think it would be OK to limit (1) particle tags (there are no researchers in the world that are going to simulate 1 trillion particles anytime soon) and (2) HOOMD's internal identifiers (we have control over this, and we are not going to need 10^5 of them). I would be extremely hesistant to restrict the counter. The user seed could probably be made less than 32 bits since most users probably choose from their favorite numbers and are not running millions of copies of the same simulation. (It would be better to choose a new starting configuration seeded from system entropy than to just keep changing the seed at that point), but we would need to document this carefully in case people are using the system time as a seed.

asmunder commented 4 years ago

Not sure if you want to include it on this issue, but a related problem we have encountered is that the maximum number of time steps you can use in a hoomd.run() command is 2^31 - 1 (maximum 32 bit integer) which is "only" 2.1B. We have been able to work around it for now just by running a loop where the inner command is hoomd.run(2e9) and the number of loop iterations is whatever we require, so somehow the machinery supports running for longer. Thus it feels like a fix should be simple?

InnocentBug commented 4 years ago

Not sure if you want to include it on this issue, but a related problem we have encountered is that the maximum number of time steps you can use in a hoomd.run() command is 2^31 - 1 (maximum 32 bit integer) which is "only" 2.1B. We have been able to work around it for now just by running a loop where the inner command is hoomd.run(2e9) and the number of loop iterations is whatever we require, so somehow the machinery supports running for longer. Thus it feels like a fix should be simple?

Hey @asmunder, checkout #229 as far as I know, this is going to be covered already with the release of v.3.0.

joaander commented 4 years ago

@jglaser came up with at least one use-case where it is helpful to be able to set the internal counter variable to be able to backtrack in a tree without needing a stack. This works as long as one is careful to only use distributions that sample single values from the generator and should be allowed in the API, though it should not be the default case. Combined with the need to more easily seed the generator with different types of bit mixed seeds and counters, I propose the following API:

The RandomGenerator constructor takes Seed [2-byte] and Counter [4-byte] counter arguments.
Seed takes in an 8-bit RNGIdentifier, 16-bit user seed, and 40-bit time step.
Different Counters will be provided:
- The general counter will take 3 32-bit values.
- Another will take 2 40-bit particle tags and a 16-bit counter value.
- Specialized Counter subclasses can be written for specific cases (i.e. several 8/16-bit values mixed with some 32-bit values in some HPMC use-cases, or the full 4-byte counter initialization use-case I mentioned above)

Separating the Counter bit mixing code from the location where the independent counter values are assigned will make the code cleaner, easier to understand, and promote code re-use across all the places in HOOMD that use RandomGenerator.

mphoward commented 4 years ago

This sounds like a really nice, clear API. One question of clarification:

The RandomGenerator constructor takes Seed [2-byte] and Counter [4-byte] counter arguments.

What do you mean by 2-byte and 4-byte here? It seems like you need way more bytes to represent each of these (and you have up to 24 bytes of input for philox), but I'm probably just being slow this morning.

joaander commented 4 years ago

I said bytes when I meant to say 32-bit words in many places above. Sorry, was just writing down the ideas I had in a brain dump.

Correction:

The RandomGenerator constructor takes Seed [8-byte] and Counter [16-byte] counter arguments.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 2 years ago

This issue has been automatically closed because it has not had recent activity.

glotzerlab / hoomd-blue