nicholas-leonard / equanimity

Experimental research for distributed conditional computation
4 stars 0 forks source link

Choose Database Backend #1

Closed nicholas-leonard closed 11 years ago

nicholas-leonard commented 11 years ago

Discussion

Hyper-optimizing models requires training multiple configurations of ML models. The best models should be persisted for a given experiment. This means that they need to be stored on a persistent medium like hard disk drives. Backup facilities are a plus since it is not always possible to obtain the same results for different trainings exercises on a given hyper-parameter configuration. Training may occur on different servers that may not share the same (network) file system. Saving models on different file systems makes administering and analysing these a difficult task. This problem may be alleviated by centralization of persistent (saved) data.

Relational database management systems (RDBMS) like PostgreSQL and MySQL provide a platform that can be used to centralize, analyse and optimize data transactions and persistence. The main downsides are that it requires knowledge of SQL and database administration (DBA), as well an accessible host to serve it on (a server). The accessibility is particularly difficult from academic high-performance computing (HPC) clusters like those provided by Compute Canada and Calcul Quebec. If it is even allowed given the inherent security risks associated with such a policy, it can only be done through SSH tunnels or VPN equivalents.

Nevertheless, the benefits of such a scheme are important. The query language SQL, and its PostgreSQL/MySQL extensions, are very powerful. They allow one to perform backups, multi-thread or multi-process the processing of SQL queries to the server, optimize data retrieval, maintenance and storage using indexes, MVCC, compression schemes (TOAST) and auto-vacuum facilities, among others, and build complex set-manipulation and statistical estimation queries for analysing data.

To Do

Inquire into accessibility of remote database from calcul canada GPU clusters. Consider using a distributed file system (no db, just files). Design backend.

nicholas-leonard commented 11 years ago

Raul was able to determine that remote databases can easily be made accessible from compute canada clusters using an ssh port forwarding scheme:

ssh -v -f -o ServerAliveInterval=240 -N -L 5432:localhost:5432 opter.iro.umontreal.ca

Which means that using PostgreSQL wont be a problem.

nicholas-leonard commented 11 years ago

http://lua-users.org/lists/lua-l/2011-01/msg01002.html

nicholas-leonard commented 11 years ago

PostgreSQL it is! for now... The other option, although more time consuming, would be providing these services through a Web Service API. But postgres is easier for now.