H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
I'm here to suggest to H2O users and staff and Arno to consider a strategic feature suggestion. Big data streaming in deeplearning is going to be a great upgrade. However, it's not really enough to put streaming capability into deeplearning only. That's necessary but not sufficient IMHO.
Streaming needs to be more general than just for deeplearning. Linear regression and linear models are still very important with big data. Streaming is going to be a key enabler to be able to simply complete a training session when your number of examples gets big.
However, what data scientists will really want deeplearning to be literally the only tool in my toolbox that can handle my big data project with H2O? I need flexibility of choice of data science algorithm. Different data characteristics will tend to work better with different learning algorithms. My data's huge number of examples alone, should not be the sole criterion which limits me to using the deeplearing algorithm on a project.
A truly big number of examples in a dataset, is generally going to automatically reduce the variance of the fitted learning model, which is awesome good news. A linear GLM model will generalize very nicely, perhaps even without regularization, at such large quantity of examples. Weblogs are an example of dataset that come to my mind that I am talking about, with typically just dozens of columns, but hundreds of millions of rows or a billion rows, very commonly and easily encountered.
Let me share an experience. At a bank several years ago, I was the consulting lead engineer on parallelism re-engineering of a Fortune 500 web statistics server system. The web logs were too big, and there were too many to process in a 24 hour cycle, for the original sequential code to keep up. We kept getting more clients who wanted to use our good webstats systems, so the slowness was going to only get worse not better if something wasn't fixed. We (my client) were a victim of our own success.
These web logs are a big data case in the real world for business enterprises of all kinds. It's mundane not as cool like computer vision. But it's very real.
I am less than happy if you tell me that H2O demands, and will always demand, strictly a neural network model for all my big weblogs. I really should be able to use GLRM and GLM on my big data, using the streaming input paradigm, not the preload-all-data paradigm.
I literally can't keep adding computers and RAM as fast as the weblogs keep growing -- probably nobody can, not even Amazon. Well maybe Amazon and Google can but I sure can't afford that. I recently got a big Amazon EC2 bill. Consequently, I had to buy my own hardware again and abandon EC2 after that bill. It's cheaper for me to buy 2-year-old hardware with lots of RAM and Xeons, than to pay Amazon $200 / month / host for EC2.
And so now at different projects (the webstats project at the bank is long ago completed) I have a fixed (not EC2) headroom again with my hardware -- I need streaming input! I will if necessary --and I don't really want to because H2O works great when it works -- I still need to go back to writing my own streaming input code using sparse matrices and read_csv on hand-partititioned partial data files of my original big dataset using python / sklearn / pandas / multiprocessing. Lack of breadth of feature-set in H2O is one reason I am still also using the traditional python stack, such as for clustering variety and metrics variety (Jaccard, hamming, cosine similarity), and also for other learning algorithms and for approximate algorithms like LSHForest (which needs improving in this implementation). I commonly split up big datafiles using bash shell code to get the level of file-reading parallelism I need from pandas, that H2O.inputFile() does so well, so easily, and better than anything. I wish I could just keep using H2O all the time even just for CSV file reading!
I have written too much of the computer sciencey code, too much time wasted there, to get the needed parallelism and the big data capability on my hardware which always runs out eventually.
I would rather be writing data-sciencey code only. It would be MUCH better for me as a data scientist if H2O's staff does the computer-science coding for me, especially the parallelism and the big data handling.
Streaming input is computer science code that is important to use to conquer the big data ingestion of all kinds, not just photographs with deep NN learning algo.
Also, approximate algorithms are important for big data. LSH is one such algo family because distance computation is at the heart of almost all clustering algorithms. LSHForest in the sklearn is a rough and slow implementation, but it shows the way forward. Instead of all-pairs, we can compute good-pairs and near-pairs, and we don't need to populate an entire distance matrix any more. That's a major improvement in speed of finding close row-vectors for clustering purposes. But I had to write all my own parallelism to get all the other cores active to compute the sparse distance matrix of near neighbors, which is not all-pairs but near-pairs like I said. H2O mapreduce infrastructure platform may be the ideal for "sparse"-like, approximate, near-pairs (not all-pairs) row-vectors distance matrix generators.
I'm here to suggest to H2O users and staff and Arno to consider a strategic feature suggestion. Big data streaming in deeplearning is going to be a great upgrade. However, it's not really enough to put streaming capability into deeplearning only. That's necessary but not sufficient IMHO.
Streaming needs to be more general than just for deeplearning. Linear regression and linear models are still very important with big data. Streaming is going to be a key enabler to be able to simply complete a training session when your number of examples gets big.
However, what data scientists will really want deeplearning to be literally the only tool in my toolbox that can handle my big data project with H2O? I need flexibility of choice of data science algorithm. Different data characteristics will tend to work better with different learning algorithms. My data's huge number of examples alone, should not be the sole criterion which limits me to using the deeplearing algorithm on a project.
A truly big number of examples in a dataset, is generally going to automatically reduce the variance of the fitted learning model, which is awesome good news. A linear GLM model will generalize very nicely, perhaps even without regularization, at such large quantity of examples. Weblogs are an example of dataset that come to my mind that I am talking about, with typically just dozens of columns, but hundreds of millions of rows or a billion rows, very commonly and easily encountered.
Let me share an experience. At a bank several years ago, I was the consulting lead engineer on parallelism re-engineering of a Fortune 500 web statistics server system. The web logs were too big, and there were too many to process in a 24 hour cycle, for the original sequential code to keep up. We kept getting more clients who wanted to use our good webstats systems, so the slowness was going to only get worse not better if something wasn't fixed. We (my client) were a victim of our own success.
These web logs are a big data case in the real world for business enterprises of all kinds. It's mundane not as cool like computer vision. But it's very real.
I am less than happy if you tell me that H2O demands, and will always demand, strictly a neural network model for all my big weblogs. I really should be able to use GLRM and GLM on my big data, using the streaming input paradigm, not the preload-all-data paradigm.
I literally can't keep adding computers and RAM as fast as the weblogs keep growing -- probably nobody can, not even Amazon. Well maybe Amazon and Google can but I sure can't afford that. I recently got a big Amazon EC2 bill. Consequently, I had to buy my own hardware again and abandon EC2 after that bill. It's cheaper for me to buy 2-year-old hardware with lots of RAM and Xeons, than to pay Amazon $200 / month / host for EC2.
And so now at different projects (the webstats project at the bank is long ago completed) I have a fixed (not EC2) headroom again with my hardware -- I need streaming input! I will if necessary --and I don't really want to because H2O works great when it works -- I still need to go back to writing my own streaming input code using sparse matrices and read_csv on hand-partititioned partial data files of my original big dataset using python / sklearn / pandas / multiprocessing. Lack of breadth of feature-set in H2O is one reason I am still also using the traditional python stack, such as for clustering variety and metrics variety (Jaccard, hamming, cosine similarity), and also for other learning algorithms and for approximate algorithms like LSHForest (which needs improving in this implementation). I commonly split up big datafiles using bash shell code to get the level of file-reading parallelism I need from pandas, that H2O.inputFile() does so well, so easily, and better than anything. I wish I could just keep using H2O all the time even just for CSV file reading!
I have written too much of the computer sciencey code, too much time wasted there, to get the needed parallelism and the big data capability on my hardware which always runs out eventually.
I would rather be writing data-sciencey code only. It would be MUCH better for me as a data scientist if H2O's staff does the computer-science coding for me, especially the parallelism and the big data handling.
Streaming input is computer science code that is important to use to conquer the big data ingestion of all kinds, not just photographs with deep NN learning algo.
Also, approximate algorithms are important for big data. LSH is one such algo family because distance computation is at the heart of almost all clustering algorithms. LSHForest in the sklearn is a rough and slow implementation, but it shows the way forward. Instead of all-pairs, we can compute good-pairs and near-pairs, and we don't need to populate an entire distance matrix any more. That's a major improvement in speed of finding close row-vectors for clustering purposes. But I had to write all my own parallelism to get all the other cores active to compute the sparse distance matrix of near neighbors, which is not all-pairs but near-pairs like I said. H2O mapreduce infrastructure platform may be the ideal for "sparse"-like, approximate, near-pairs (not all-pairs) row-vectors distance matrix generators.
Thanks for your consideration.
Geoffrey Anderson
Reference: https://groups.google.com/forum/#!topic/h2ostream/dO8Lzor2Kg0