h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.94k stars 2k forks source link

improve column types detection in CSV parser #7687

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Given the infamous titanic dataset:

{code:python}h2o.import_file("smalldata/gbm_test/titanic.csv"){code}

produces:

{noformat} pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest


   1           1  Allen  Miss. Elisabeth Walton                    female  29             0        0     24160  211.338   B5       S                2     nan  St Louis  MO
   1           1  Allison  Master. Hudson Trevor                   male     0.9167        1        2    113781  151.55    C22 C26  S               11     nan  Montreal  PQ / Chesterville  ON
   1           0  Allison  Miss. Helen Loraine                     female   2             1        2    113781  151.55    C22 C26  S              nan     nan  Montreal  PQ / Chesterville  ON
   1           0  Allison  Mr. Hudson Joshua Creighton             male    30             1        2    113781  151.55    C22 C26  S              nan     135  Montreal  PQ / Chesterville  ON
   1           0  Allison  Mrs. Hudson J C (Bessie Waldo Daniels)  female  25             1        2    113781  151.55    C22 C26  S              nan     nan  Montreal  PQ / Chesterville  ON
   1           1  Anderson  Mr. Harry                              male    48             0        0     19952   26.55    E12      S                3     nan  New York  NY
   1           1  Andrews  Miss. Kornelia Theodosia                female  63             1        0     13502   77.9583  D7       S               10     nan  Hudson  NY
   1           0  Andrews  Mr. Thomas Jr                           male    39             0        0    112050    0       A36      S              nan     nan  Belfast  NI
   1           1  Appleton  Mrs. Edward Dale (Charlotte Lamson)    female  53             2        0     11769   51.4792  C101     S              nan     nan  Bayside  Queens  NY
   1           0  Artagaveytia  Mr. Ramon                          male    71             0        0       nan   49.5042           C              nan      22  Montevideo  Uruguay{noformat}

Compare with pandas:

{noformat}pd.read_csv("smalldata/gbm_test/titanic.csv").iloc[0:10,:]{noformat}

{noformat} pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest 0 1 1 Allen Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis MO 1 1 1 Allison Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal PQ / Chesterville ON 2 1 0 Allison Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal PQ / Chesterville ON 3 1 0 Allison Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal PQ / Chesterville ON 4 1 0 Allison Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal PQ / Chesterville ON 5 1 1 Anderson Mr. Harry male 48.0000 0 0 19952 26.5500 E12 S 3 NaN New York NY 6 1 1 Andrews Miss. Kornelia Theodosia female 63.0000 1 0 13502 77.9583 D7 S 10 NaN Hudson NY 7 1 0 Andrews Mr. Thomas Jr male 39.0000 0 0 112050 0.0000 A36 S NaN NaN Belfast NI 8 1 1 Appleton Mrs. Edward Dale (Charlotte Lamson) female 53.0000 2 0 11769 51.4792 C101 S D NaN Bayside Queens NY 9 1 0 Artagaveytia Mr. Ramon male 71.0000 0 0 PC 17609 49.5042 NaN C NaN 22.0 Montevideo Uruguay{noformat}

The 2 parsers don’t detect the same column types:

h2o:

{noformat}{'pclass': 'int', 'survived': 'int', 'name': 'string', 'sex': 'enum', 'age': 'real', 'sibsp': 'int', 'parch': 'int', 'ticket': 'int', 'fare': 'real', 'cabin': 'enum', 'embarked': 'enum', 'boat': 'int', 'body': 'int', 'home.dest': 'enum'}{noformat}

pandas:

{noformat}pclass int64 survived int64 name object sex object age float64 sibsp int64 parch int64 ticket object fare float64 cabin object embarked object boat object body float64 home.dest object dtype: object{noformat}

Note that 2 columns —{{ticket}} and {{boat}}— are correctly interpreted as{{object}} (ie {{enum}}) by pandas, but as numerical ({{int}}) by H2O-3. This leads to data loss as we can see in the sample shown above: with H2O, non numerical values in these columns are converted to NaN.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7961 Assignee: Michal Kurka Reporter: Sebastien Poirier State: Open Fix Version: Backlog Attachments: N/A Development PRs: N/A