h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 2k forks source link

H2O CSV parser fails handling commas in quoted values #7679

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

H2O parser fails loading dataset [https://www.openml.org/d/42727|https://www.openml.org/d/42727] correctly. Created 2 dataset samples useful for tests

[^colleges_sample.arff] [^colleges_sample.csv]

Parsing looks good for the first columns, until the {{region}} column by the end breaks the remaining columns by considering the commas in the quote region value as column separators:

{noformat}In [30]: fr = h2o.import_file("/Users/seb/Downloads/colleges_sample.csv", quotechar="'") Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%

In [31]: fr.head() Out[31]: UNITID school_name city state zip school_webpage latitude longitude admission_rate sat_verbal_midrange sat_math_midrange sat_writing_midrange act_combined_midrange act_english_midrange act_math_midrange act_writing_midrange sat_total_average undergrad_size percent_white percent_black percent_hispanic percent_asian percent_part_time average_cost_academic_year average_cost_programyear tuition(instate) tuition_(out_of_state) spend_per_student faculty_salary percent_part_time_faculty percent_pell_grant completion_rate predominant_degree highest_degree ownership region gender carnegie_basic_classification carnegie_undergraduate carnegie_size religious_affiliation percent_female agege24 faminc mean_earnings_6_years median_earnings_6_years mean_earnings_10_years median_earnings_10_years


100654 'Alabama A & M University' Normal AL 35762 www.aamu.edu/ 34.7834 -86.5685 0.8989 410 400 ? 17.0 17.0 17.0 ? 823.0 4051 0.0279 0.9501 0.0089 0.0022 0.0622 18888 ? 7182 12774 7459 7079 0.8856 0.7115 0.2914 Bachelors Graduate Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100663 'University of Alabama at Birmingham' Birmingham AL nan www.uab.edu 33.5022 -86.8092 0.8673 580 585 ? 25.0 26.0 23.0 ? 1146.0 11200 0.5987 0.259 0.0258 0.0518 0.2579 19990 ? 7206 16398 17208 10170 0.9106 0.3505 0.5377 Bachelors Graduate Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100690 'Amridge University' Montgomery AL nan www.amridgeuniversity.edu 32.3626 -86.174 nan nan nan ? ? ? ? ? ? 322 0.2919 0.4224 0.0093 0.0031 0.3727 12300 ? 6870 6870 5123 3849 0.6721 0.6839 0.6667 Bachelors Graduate 'Private nonprofit' 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100706 'University of Alabama in Huntsville' Huntsville AL 35899 www.uah.edu 34.7228 -86.6384 0.8062 575 580 ? 26.0 26.0 25.0 ? 1180.0 5525 0.7012 0.131 0.0338 0.0364 0.2395 20306 ? 9192 21506 9352 9341 0.6555 0.3281 0.4835 Bachelors Graduate Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100724 'Alabama State University' Montgomery AL nan www.alasu.edu/email/index.aspx 32.3643 -86.2957 0.5125 430 425 ? 17.0 17.0 17.0 ? 830.0 5354 0.0161 0.9285 0.0114 0.0015 0.0902 17400 ? 8720 15656 7393 6557 0.6641 0.8265 0.2517 Bachelors Graduate Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100751 'The University of Alabama' Tuscaloosa AL nan www.ua.edu/ 33.2144 -87.5458 0.5655 555 570 540.0 26.0 27.0 25.0 7.0 1171.0 28692 0.7865 0.114 0.0313 0.0112 0.0852 26717 ? 9450 23950 9817 9605 0.7109 0.2107 0.6665 Bachelors Graduate Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100760 'Central Alabama Community College' 'Alexander City' AL 35010 www.cacc.edu 32.9244 -85.9465 nan nan nan ? ? ? ? ? ? 1779 0.6785 0.2945 0.0118 0.0022 0.466 12103 ? 4200 7500 5935 5805 0.3871 0.6515 nan Associate Associate Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100812 'Athens State University' Athens AL 35611 www.athens.edu 34.8056 -86.9651 nan nan nan ? ? ? ? ? ? 2999 0.7513 0.1064 0.0213 0.0047 0.5502 nan ? nan nan 6176 7672 0.4412 0.4107 nan Bachelors Bachelors Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100830 'Auburn University at Montgomery' Montgomery AL nan www.aum.edu 32.3699 -86.1774 0.8371 nan nan ? 21.0 21.0 20.0 ? 970.0 4322 0.5532 0.3031 0.0079 0.0245 0.3061 16556 ? 8750 24950 6817 7173 0.9262 0.4006 0.2705 Bachelors Graduate Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan 100858 'Auburn University' 'Auburn University' AL 36849 www.auburn.edu 32.6002 -85.4924 0.8274 570 595 565.0 27.0 28.0 26.0 7.0 1215.0 19761 0.8543 0.0714 0.0253 0.0213 0.0902 23788 ? 9852 26364 11324 9429 0.878 0.1687 0.6792 Bachelors Graduate Public 'Southeast (AL AR FL GA KY LA nan nan nan nan nan nan nan

[10 rows x 48 columns] {noformat}

{{pandas}} also fails loading it by default, but loads it correctly with {{quotechar=”'“}}:

{noformat}In [27]: df = pd.read_csv("/Users/seb/Downloads/colleges_sample.csv", quotechar="'")

In [28]: df Out[28]: UNITID school_name ... mean_earnings_10_years median_earnings_10_years 0 100654 Alabama A & M University ... 35300.0 31400.0 1 100663 University of Alabama at Birmingham ... 46300.0 40300.0 2 100690 Amridge University ... 42100.0 38100.0 3 100706 University of Alabama in Huntsville ... 52700.0 46600.0 4 100724 Alabama State University ... 30700.0 27800.0 5 100751 The University of Alabama ... 49100.0 42400.0 6 100760 Central Alabama Community College ... 31400.0 27100.0 7 100812 Athens State University ... 41500.0 39700.0 8 100830 Auburn University at Montgomery ... 36700.0 34800.0 9 100858 Auburn University ... 52100.0 45400.0 10 100937 Birmingham Southern College ... 45300.0 41900.0 11 101028 Chattahoochee Valley Community College ... 31200.0 27700.0 12 101073 Concordia College Alabama ... 24400.0 21300.0 13 101116 South University-Montgomery ... 35500.0 29800.0 14 101143 Enterprise State Community College ... 30600.0 26300.0 15 101161 James H Faulkner State Community College ... 31700.0 28900.0 16 101189 Faulkner University ... 43500.0 37600.0 17 101240 Gadsden State Community College ... 29900.0 26900.0 18 101277 New Beginning College of Cosmetology ... ? ? 19 101286 George C Wallace State Community College-Dothan ... 31900.0 27100.0

[20 rows x 48 columns] {noformat}

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: also failing on [https://www.openml.org/t/359993|https://www.openml.org/t/359993]

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7996 Assignee: Sebastien Poirier Reporter: Sebastien Poirier State: Resolved Fix Version: 3.32.1.1 Attachments: Available (Count: 2) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/5389 https://github.com/h2oai/h2o-3/pull/5388

Attachments From Jira

Attachment Name: colleges_sample.arff Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7996/colleges_sample.arff

Attachment Name: colleges_sample.csv Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7996/colleges_sample.csv