mlbench / mlbench-old

!!!!!DEPRECATED!!!! distributed machine learning benchmark - a public benchmark of distributed ML solvers and frameworks
Apache License 2.0
40 stars 8 forks source link

Replace SQlite with Postgres #25

Closed Panaetius closed 6 years ago

Panaetius commented 6 years ago

Currently, the master stores values in an SQLite DB, which has problems with concurrent writes.

This should be replaced with a postgres instance.

martinjaggi commented 6 years ago

we might need to store models as progress during the training (checkpointing), in addition to a bunch of metadata (timestamp etc). each model can be order of 1GB, so i wonder if a database is the right choice here. for the linear stuff, models will always just be a dense vector. for neural nets, we could think about onnx in the longer term maybe, but it's not necessary at first.

martinjaggi commented 6 years ago

in any case i was thinking concurrent writes should be impossible, since only the master will ever write these statistics, and communication rounds are strictly synchronous, right?

Panaetius commented 6 years ago

The way it is right now, the master node hosts the API&Dashboard. The API exposes metrics and metadata and is used by the Dashboard to get the data it displays, but can also be used independently of the dashboard.

Currently, the things saved are:

E.g. CPU + Memory monitoring: image

This is collected by independent background jobs that regularly save it to the DB.

It's also possible for workers to send arbitrary scalars to the API to show as well (E.g. accuracy), which (in the future) can be visualized as well. And this can be concurrent writes, if pod1 and pod2 send local accuracy at the same time, for instance.

Though generally in an all-reduce setting, only the main Pod will send metrics to the API. But it is still concurrent with the different background jobs.

Additionally, each Run Configuration will be saved to the DB as well with all the metadata and linked to the metrics collected, so it's easier to keep track of experiments.

This doesn't really concern storing checkpoints/large tensors, for now. The DB might simply contain a reference to some other storage location in the future. A REST API is also not really feasible to transmit large files/BLOBs.

I.e. a separate solution is needed for that, but a DB is necessary nonetheless for other/simpler data.