H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
{noformat} pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
1 1 Allen Miss. Elisabeth Walton female 29 0 0 24160 211.338 B5 S 2 nan St Louis MO
1 1 Allison Master. Hudson Trevor male 0.9167 1 2 113781 151.55 C22 C26 S 11 nan Montreal PQ / Chesterville ON
1 0 Allison Miss. Helen Loraine female 2 1 2 113781 151.55 C22 C26 S nan nan Montreal PQ / Chesterville ON
1 0 Allison Mr. Hudson Joshua Creighton male 30 1 2 113781 151.55 C22 C26 S nan 135 Montreal PQ / Chesterville ON
1 0 Allison Mrs. Hudson J C (Bessie Waldo Daniels) female 25 1 2 113781 151.55 C22 C26 S nan nan Montreal PQ / Chesterville ON
1 1 Anderson Mr. Harry male 48 0 0 19952 26.55 E12 S 3 nan New York NY
1 1 Andrews Miss. Kornelia Theodosia female 63 1 0 13502 77.9583 D7 S 10 nan Hudson NY
1 0 Andrews Mr. Thomas Jr male 39 0 0 112050 0 A36 S nan nan Belfast NI
1 1 Appleton Mrs. Edward Dale (Charlotte Lamson) female 53 2 0 11769 51.4792 C101 S nan nan Bayside Queens NY
1 0 Artagaveytia Mr. Ramon male 71 0 0 nan 49.5042 C nan 22 Montevideo Uruguay{noformat}
{noformat}
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis MO
1 1 1 Allison Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal PQ / Chesterville ON
2 1 0 Allison Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal PQ / Chesterville ON
3 1 0 Allison Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal PQ / Chesterville ON
4 1 0 Allison Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal PQ / Chesterville ON
5 1 1 Anderson Mr. Harry male 48.0000 0 0 19952 26.5500 E12 S 3 NaN New York NY
6 1 1 Andrews Miss. Kornelia Theodosia female 63.0000 1 0 13502 77.9583 D7 S 10 NaN Hudson NY
7 1 0 Andrews Mr. Thomas Jr male 39.0000 0 0 112050 0.0000 A36 S NaN NaN Belfast NI
8 1 1 Appleton Mrs. Edward Dale (Charlotte Lamson) female 53.0000 2 0 11769 51.4792 C101 S D NaN Bayside Queens NY
9 1 0 Artagaveytia Mr. Ramon male 71.0000 0 0 PC 17609 49.5042 NaN C NaN 22.0 Montevideo Uruguay{noformat}
{noformat}pclass int64
survived int64
name object
sex object
age float64
sibsp int64
parch int64
ticket object
fare float64
cabin object
embarked object
boat object
body float64
home.dest object
dtype: object{noformat}
Note that 2 columns —{{ticket}} and {{boat}}— are correctly interpreted as{{object}} (ie {{enum}}) by pandas, but as numerical ({{int}}) by H2O-3.
This leads to data loss as we can see in the sample shown above: with H2O, non numerical values in these columns are converted to NaN.
Given the infamous titanic dataset:
{code:python}h2o.import_file("smalldata/gbm_test/titanic.csv"){code}
produces:
{noformat} pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
Compare with pandas:
{noformat}pd.read_csv("smalldata/gbm_test/titanic.csv").iloc[0:10,:]{noformat}
{noformat} pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest 0 1 1 Allen Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis MO 1 1 1 Allison Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal PQ / Chesterville ON 2 1 0 Allison Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal PQ / Chesterville ON 3 1 0 Allison Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal PQ / Chesterville ON 4 1 0 Allison Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal PQ / Chesterville ON 5 1 1 Anderson Mr. Harry male 48.0000 0 0 19952 26.5500 E12 S 3 NaN New York NY 6 1 1 Andrews Miss. Kornelia Theodosia female 63.0000 1 0 13502 77.9583 D7 S 10 NaN Hudson NY 7 1 0 Andrews Mr. Thomas Jr male 39.0000 0 0 112050 0.0000 A36 S NaN NaN Belfast NI 8 1 1 Appleton Mrs. Edward Dale (Charlotte Lamson) female 53.0000 2 0 11769 51.4792 C101 S D NaN Bayside Queens NY 9 1 0 Artagaveytia Mr. Ramon male 71.0000 0 0 PC 17609 49.5042 NaN C NaN 22.0 Montevideo Uruguay{noformat}
The 2 parsers don’t detect the same column types:
h2o:
{noformat}{'pclass': 'int', 'survived': 'int', 'name': 'string', 'sex': 'enum', 'age': 'real', 'sibsp': 'int', 'parch': 'int', 'ticket': 'int', 'fare': 'real', 'cabin': 'enum', 'embarked': 'enum', 'boat': 'int', 'body': 'int', 'home.dest': 'enum'}{noformat}
pandas:
{noformat}pclass int64 survived int64 name object sex object age float64 sibsp int64 parch int64 ticket object fare float64 cabin object embarked object boat object body float64 home.dest object dtype: object{noformat}
Note that 2 columns —{{ticket}} and {{boat}}— are correctly interpreted as{{object}} (ie {{enum}}) by pandas, but as numerical ({{int}}) by H2O-3. This leads to data loss as we can see in the sample shown above: with H2O, non numerical values in these columns are converted to NaN.