Open braceal opened 4 months ago
If we add configurable parameters that specify the number of training workers and inference workers then we can scale up more efficiently by running multiple training jobs concurrently and always using the latest set of weights for inference.
When simulations are very fast (or run for short time scales), they can generate data faster than the training and inference can keep up. In this case, it may be useful to have an option that dedicates a GPU to inference so that we can always use the latest data.