nusdbsystem / singa-auto

A platform to automate the training (including hyper-parameter tuning) and inference of machine learning models
Apache License 2.0
12 stars 16 forks source link

Storage #13

Open nudles opened 4 years ago

nudles commented 4 years ago

We have the following types of data to store

  1. dataset
  2. model (file or folder)
  3. model checkpoint
  4. query
  5. user
  6. job
  7. logs

We need to figure out the storage system, e.g., database (postgresql), key-value store (redis), NFS, distributed file system or cloud storage, for each type of data

NLGithubWP commented 4 years ago

one suggestion is: for large-size data(dataset, model checkpoint), it may not suitable for storing in db since we need to do lots of db query in admin services, if the data is big, query will be very slow. we may use distributed file system(hadoop) or cloud storage(eg, s3)

nudles commented 4 years ago

When the data file (zip) is large, there could be out-of-memory issue during the decompressing of the zip file after uploading.

nudles commented 4 years ago

Is there any distributed file systems optimized for machine learning tasks? Can we deploy S3 in our own cluster?

NLGithubWP commented 4 years ago

i think we can

chrishkchris commented 4 years ago

@naili-xing FYI, you may also interested in HopFS as a distributed file system , which can be considered as an improved design of HDFS. Hopswork is using it as a distributed file system for machine learning platform.

Github: https://github.com/hopshadoop/hops

Paper: https://www.usenix.org/system/files/conference/fast17/fast17-niazi.pdf

Their ML platform Hopsworks: https://www.logicalclocks.com/

chrishkchris commented 4 years ago

In additional to the above comment, you may be interested in how the Hopswork utilize HopFS for metadata store, I suggest you study this code as reference https://github.com/logicalclocks/hopsworks