Open nudles opened 4 years ago
one suggestion is: for large-size data(dataset, model checkpoint), it may not suitable for storing in db since we need to do lots of db query in admin services, if the data is big, query will be very slow. we may use distributed file system(hadoop) or cloud storage(eg, s3)
When the data file (zip) is large, there could be out-of-memory issue during the decompressing of the zip file after uploading.
Is there any distributed file systems optimized for machine learning tasks? Can we deploy S3 in our own cluster?
i think we can
@naili-xing FYI, you may also interested in HopFS as a distributed file system , which can be considered as an improved design of HDFS. Hopswork is using it as a distributed file system for machine learning platform.
Github: https://github.com/hopshadoop/hops
Paper: https://www.usenix.org/system/files/conference/fast17/fast17-niazi.pdf
Their ML platform Hopsworks: https://www.logicalclocks.com/
In additional to the above comment, you may be interested in how the Hopswork utilize HopFS for metadata store, I suggest you study this code as reference https://github.com/logicalclocks/hopsworks
We have the following types of data to store
We need to figure out the storage system, e.g., database (postgresql), key-value store (redis), NFS, distributed file system or cloud storage, for each type of data