stas00 / ml-engineering

Machine Learning Engineering Open Book
https://stasosphere.com/machine-learning/
Creative Commons Attribution Share Alike 4.0 International
10.97k stars 657 forks source link

Conflicting opinions about streaming data from cloud storage? #30

Closed hacobe closed 6 months ago

hacobe commented 6 months ago

(1) and (2) seem to express different opinions:

1) In the "3 Machine Learning IO needs" section, one of the bullet points under "Incoming suggestions from Ross Wightman to integrate" is "Note that once your datasets are optimally friendly for a large, distributed network filesystem, they can usually just be streamed from bucket storage in cloud systems that have that option. So better to move them off the network filesystem in that case."

2) The section "Local storage beats cloud storage" starts with "While cloud storage is cheaper the whole idea of fetching and processing your training data stream dynamically at training time is very problematic with a huge number of issues around it...It’s so much better to have enough disk space locally for data loading."

What am I missing?

stas00 commented 6 months ago

Thank you very much for pointing the incongruity, @hacobe - I have fixed it here.

But basically these are 2 different opinions by 2 different people. I moved Ross' suggestions to incoming so that I could integrate them properly later. I shouldn't have dumped them into the main text as is.

Bottom line is that I am yet to find a good streaming solution and that's my experience. Ross seems to have had a working streaming solution, but we have been doing very different things, so possibly both are possible.

hacobe commented 6 months ago

Thanks!