h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

H2O flow creates extra row with missing values when uploading file (with column names) #15441

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

In attachments

  1. Image of the missing value (extra row created) - csv uploaded with column names
  2. When the csv (with column names removed) is uploaded - no row with missing values created

Tries to create new column names with Enum type variables? Note: Online (response variable has to be enum for ROC curve)

Info: No error in parse parseFiles source_frames: ["VolledigeDatasetemailfixmetnamen.csv"] destination_frame: "Key_Frame__VolledigeDatasetemailfixmetnamen.hex" parse_type: "CSV" separator: 44 number_columns: 829 single_quotes: false column_names: ["","price","user","views","online","category","email","price_type","ip","t_aanbieding","t_aanbod","t_aangeboden","t_aankoop","t_aantrekkelijke","t_account","t_active","t_adoptie","t_afrikaanse","t_airco","t_airsoft","t_alfa","t_alle","t_alles","t_aluminium","t_amateur","t_amerikaanse","t_andalusie","t_andere","t_antieke","t_antwerpen","t_appartement","t_apple","t_ardennen","t_astra","t_audi","t_auto","t_automaat","t_automatten","t_avantgarde","t_baby","t_belgie","t_belgische","t_bellen","t_benz","t_benzine","t_bepaalde","t_beschikbaar","t_beste","t_betrouwbare","t_bieden","t_biedt","t_bijles","t_bijverdienste","t_black","t_blackberry","t_blanca","t_blauw","t_blauwe","t_blue","t_bluetooth","t_boek","t_brand","t_break","t_brits","t_britse","t_bull","t_bulldog","t_business","t_cabrio","t_camera","t_canon","t_carnet","t_cdti","t_chalet","t_charles","t_chevrolet","t_chihuahua","t_citroen","t_classic","t_clim","t_clio","t_comfort","t_comfortline","t_complete","t_contact","t_cooper","t_corsa","t_cosmo","t_costa","t_coupe","t_crdi","t_cruise","t_cuir","t_dame","t_dames","t_date","t_daten","t_dating","t_design","t_details","t_deze","t_diesel","t_dikke","t_direct","t_diverse","t_door","t_doos","t_dringend","t_duitse","t_echt","t_echte","t_edition","t_eens","t_eetkamer","t_eigen","t_eigenaar","t_eiken","t_elegance","t_elektrische","t_engels","t_engelse","t_enjoy","t_ernstige","t_escort","t_euro","t_exclusive","t_export","t_extra","t_factory","t_fiat","t_fiesta","t_financiᅢテᅡᆱle","t_financiering","t_focus","t_ford","t_franse","t_full","t_gaan","t_galaxy","t_garantie","t_gashaard","t_geen","t_geil","t_geile","t_geld","t_gevraagd","t_gezocht","t_gezonde","t_goed","t_goede","t_golden","t_golf","t_goud","t_grand","t_gratis","t_grijs","t_grijze","t_groot","t_grote","t_haar","t_heeft","t_heel","t_herder","t_heren","t_hete","t_hier","t_hond","t_honda","t_houten","t_houtkachel","t_huis","t_hulp","t_huren","t_husky","t_huur","t_hyundai","t_inch","t_ipad","t_iphone","t_jaar","t_jack","t_jaguar","t_jong","t_jonge","t_jouw","t_kast","t_king","t_kinky","t_kitten","t_kittens","t_klaar","t_kleine","t_kleur","t_koop","t_kopen","t_kort","t_korthaar","t_korting","t_krediet","t_kwaliteit","t_labrador","t_lage","t_land","t_leder","t_lederen","t_lekker","t_lekkere","t_lening","t_leningen","t_leuk","t_leuke","t_leven","t_lieve","t_limousine","t_line","t_live","t_logo","t_lounge","t_luxe","t_maar","t_maat","t_macbook","t_main","t_maken","t_maltese","t_mannelijke","t_mannen","t_manueel","t_massage","t_massief","t_mazda","t_meer","t_megane","t_meid","t_meiden","t_meisje","t_mensen","t_mercedes","t_merk","t_merken","t_meubelen","t_mijn","t_mini","t_mitsubishi","t_model","t_mois","t_mooi","t_mooie","t_multijet","t_naar","t_navi","t_navigatie","t_nederland","t_neger","t_niet","t_nieuw","t_nieuwe","t_nikon","t_nissan","t_nodig","t_nokia","t_online","t_opel","t_open","t_opkoper","t_option","t_opzoek","t_origineel","t_originele","t_oude","t_oudere","t_over","t_pack","t_pano","t_papegaaien","t_particulieren","t_partner","t_passat","t_perfecte","t_pers","t_personen","t_perzische","t_peugeot","t_phone","t_picasso","t_plus","t_polo","t_pookhoes","t_porsche","t_prachtig","t_prachtige","t_premium","t_prijs","t_prive","t_pupjes","t_puppies","t_puppy","t_pups","t_quattro","t_radio","t_ragdoll","t_regio","t_relatie","t_renault","t_rente","t_retriever","t_rijpe","t_romeo","t_rood","t_rover","t_runescape","t_salon","t_salontafel","t_samsung","t_sauna","t_schade","t_schattig","t_schattige","t_seat","t_seks","t_sexdate","t_sexdaten","t_sexdating","t_sexy","t_shirt","t_siberische","t_singles","t_skoda","t_slaapkamer","t_snel","t_snelle","t_sold","t_sony","t_spanje","t_spannend","t_sport","t_sprinter","t_staat","t_stamboom","t_start","t_stoelen","t_stop","t_studio","t_stuks","t_super","t_suzuki","t_tafel","t_tdci","t_tegen","t_telefoonsex","t_terrier","t_thuis","t_thuiswerk","t_ticket","t_tieners","t_titanium","t_toepassing","t_touring","t_toyota","t_trend","t_trendline","t_tuin","t_turbo","t_tussen","t_twee","t_type","t_unlocked","t_vakantie","t_vakantiehuis","t_vakantiehuisjes","t_vakantiewoning","t_vakantiewoningen","t_vanaf","t_vandaag","t_veel","t_velgen","t_velours","t_vendu","t_verdienen","t_verkocht","t_verkoop","t_versnellingen","t_verzenden","t_villa","t_volkswagen","t_volledig","t_volvo","t_voor","t_vrijstaande","t_vrouw","t_vrouwelijke","t_vrouwen","t_wagen","t_wagens","t_waterfilter","t_weken","t_welkom","t_werk","t_white","t_willen","t_wilt","t_witte","t_woning","t_worden","t_xenon","t_yamaha","t_yorkshire","t_zafira","t_zeer","t_zijn","t_zilver","t_zoek","t_zoeken","t_zoekt","t_zonder","t_zuid","t_zwart","t_zwarte","t_zwembad","d_aanbieding","d_aanbod","d_aangeboden","d_aankoop","d_aantrekkelijke","d_account","d_active","d_adoptie","d_afrikaanse","d_airco","d_airsoft","d_alfa","d_alle","d_alles","d_aluminium","d_amateur","d_amerikaanse","d_andalusie","d_andere","d_antieke","d_antwerpen","d_appartement","d_apple","d_ardennen","d_astra","d_audi","d_auto","d_automaat","d_automatten","d_avantgarde","d_baby","d_belgie","d_belgische","d_bellen","d_benz","d_benzine","d_bepaalde","d_beschikbaar","d_beste","d_betrouwbare","d_bieden","d_biedt","d_bijles","d_bijverdienste","d_black","d_blackberry","d_blanca","d_blauw","d_blauwe","d_blue","d_bluetooth","d_boek","d_brand","d_break","d_brits","d_britse","d_bull","d_bulldog","d_business","d_cabrio","d_camera","d_canon","d_carnet","d_cdti","d_chalet","d_charles","d_chevrolet","d_chihuahua","d_citroen","d_classic","d_clim","d_clio","d_comfort","d_comfortline","d_complete","d_contact","d_cooper","d_corsa","d_cosmo","d_costa","d_coupe","d_crdi","d_cruise","d_cuir","d_dame","d_dames","d_date","d_daten","d_dating","d_design","d_details","d_deze","d_diesel","d_dikke","d_direct","d_diverse","d_door","d_doos","d_dringend","d_duitse","d_echt","d_echte","d_edition","d_eens","d_eetkamer","d_eigen","d_eigenaar","d_eiken","d_elegance","d_elektrische","d_engels","d_engelse","d_enjoy","d_ernstige","d_escort","d_euro","d_exclusive","d_export","d_extra","d_factory","d_fiat","d_fiesta","d_financiᅢテᅡᆱle","d_financiering","d_focus","d_ford","d_franse","d_full","d_gaan","d_galaxy","d_garantie","d_gashaard","d_geen","d_geil","d_geile","d_geld","d_gevraagd","d_gezocht","d_gezonde","d_goed","d_goede","d_golden","d_golf","d_goud","d_grand","d_gratis","d_grijs","d_grijze","d_groot","d_grote","d_haar","d_heeft","d_heel","d_herder","d_heren","d_hete","d_hier","d_hond","d_honda","d_houten","d_houtkachel","d_huis","d_hulp","d_huren","d_husky","d_huur","d_hyundai","d_inch","d_ipad","d_iphone","d_jaar","d_jack","d_jaguar","d_jong","d_jonge","d_jouw","d_kast","d_king","d_kinky","d_kitten","d_kittens","d_klaar","d_kleine","d_kleur","d_koop","d_kopen","d_kort","d_korthaar","d_korting","d_krediet","d_kwaliteit","d_labrador","d_lage","d_land","d_leder","d_lederen","d_lekker","d_lekkere","d_lening","d_leningen","d_leuk","d_leuke","d_leven","d_lieve","d_limousine","d_line","d_live","d_logo","d_lounge","d_luxe","d_maar","d_maat","d_macbook","d_main","d_maken","d_maltese","d_mannelijke","d_mannen","d_manueel","d_massage","d_massief","d_mazda","d_meer","d_megane","d_meid","d_meiden","d_meisje","d_mensen","d_mercedes","d_merk","d_merken","d_meubelen","d_mijn","d_mini","d_mitsubishi","d_model","d_mois","d_mooi","d_mooie","d_multijet","d_naar","d_navi","d_navigatie","d_nederland","d_neger","d_niet","d_nieuw","d_nieuwe","d_nikon","d_nissan","d_nodig","d_nokia","d_online","d_opel","d_open","d_opkoper","d_option","d_opzoek","d_origineel","d_originele","d_oude","d_oudere","d_over","d_pack","d_pano","d_papegaaien","d_particulieren","d_partner","d_passat","d_perfecte","d_pers","d_personen","d_perzische","d_peugeot","d_phone","d_picasso","d_plus","d_polo","d_pookhoes","d_porsche","d_prachtig","d_prachtige","d_premium","d_prijs","d_prive","d_pupjes","d_puppies","d_puppy","d_pups","d_quattro","d_radio","d_ragdoll","d_regio","d_relatie","d_renault","d_rente","d_retriever","d_rijpe","d_romeo","d_rood","d_rover","d_runescape","d_salon","d_salontafel","d_samsung","d_sauna","d_schade","d_schattig","d_schattige","d_seat","d_seks","d_sexdate","d_sexdaten","d_sexdating","d_sexy","d_shirt","d_siberische","d_singles","d_skoda","d_slaapkamer","d_snel","d_snelle","d_sold","d_sony","d_spanje","d_spannend","d_sport","d_sprinter","d_staat","d_stamboom","d_start","d_stoelen","d_stop","d_studio","d_stuks","d_super","d_suzuki","d_tafel","d_tdci","d_tegen","d_telefoonsex","d_terrier","d_thuis","d_thuiswerk","d_ticket","d_tieners","d_titanium","d_toepassing","d_touring","d_toyota","d_trend","d_trendline","d_tuin","d_turbo","d_tussen","d_twee","d_type","d_unlocked","d_vakantie","d_vakantiehuis","d_vakantiehuisjes","d_vakantiewoning","d_vakantiewoningen","d_vanaf","d_vandaag","d_veel","d_velgen","d_velours","d_vendu","d_verdienen","d_verkocht","d_verkoop","d_versnellingen","d_verzenden","d_villa","d_volkswagen","d_volledig","d_volvo","d_voor","d_vrijstaande","d_vrouw","d_vrouwelijke","d_vrouwen","d_wagen","d_wagens","d_waterfilter","d_weken","d_welkom","d_werk","d_white","d_willen","d_wilt","d_witte","d_woning","d_worden","d_xenon","d_yamaha","d_yorkshire","d_zafira","d_zeer","d_zijn","d_zilver","d_zoek","d_zoeken","d_zoekt","d_zonder","d_zuid","d_zwart","d_zwarte","d_zwembad"] column_types: ["Numeric","Numeric","Numeric","Numeric","Enum","Numeric","Enum","Numeric","Enum","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric"] delete_on_done: true check_header: 1 chunk_size: 4194304

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Described behavior could not be reproduced. Two identical CSVs were used, one with the very first row containing the column names and one without it. There were no differences in number of rows parsed when the only difference between the two CSVs was the header column being present or missing.

However, the behavior [~accountid:5a4e17bd065f952a4ad00a9a] describes can be reproduced in a slightly different way. Given a table with 3 rows, first row containing the column names:

{code:java} Name,Value Johny,25 Paul,35 {code}

such CSV is parsed correctly, resulting in 2 rows parsed. However, when a new line with a space or any non-printable character is added:

{code:java} Name,Value Johny,25 Paul,35 <-a single space or more spaces here {code}

The resulting frame contains new row with empty values.

The same behavior is observed when the first row does not contain column names.

!Snímek obrazovky pořízený 2018-01-08 09-36-03.png|thumbnail!

The theory is "explanatory variable" for an empty rows parsed is not the presence of a row with column names, but rather presence of an empty line with non-printable character present, e.g. a space. Most editors (LibreOffice Calc, Sublime etc.) tend to remove spaces and similar character on save in case of CSV format. A possible scenario would be :

Is the behavior discovered valid ? [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6]

exalate-issue-sync[bot] commented 1 year ago

Michal Malohlava commented: [~accountid:5a32df017dcf343865c26fa5] good catch! I believe it is a bug. I tried to modify your example and run a few additional experiments:

  1. injected an empty row WITHOUT space in the middle of dataset (no empty space at the end) {noformat} Name,Value Johny,25

Paul,35 {noformat} The dataset was parsed into a frame with 2rows as EXPECTED.

  1. injected a proper empty row (contains a delimiter) in the middle of dataset {noformat} Name,Value Johny,25 , Paul,35 {noformat} This dataset was parsed into a frame with 3row. But this is EXPECTED as well.

  2. Finally, injected an empty row with a leading space in the middle of dataset (no empty space at the end) {noformat} Name,Value Johny,25 <- a space character Paul,35 {noformat} The dataset was parsed into a frame with 3rows. This is UNEXPECTED, but explainable - we treat a row as an incomplete row with a value in the first column (empty value) I believe, that this is wrong behavior - we should ignore the row since it does not contain delimiter and the value of first column is "empty" (i do not remember any use case where people were using whitespace characters to express value)

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: What I find weird is that the first column name is actually an empty string (see the parseFiles statement).

I think what the user is saying is that he first uploads the csv with a header line and then without a header line. The upload with a header line produces one extra row in the Frame. This makes me think there is a problem with the header line itself and not in the rest of the file.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Hello [~accountid:5a4e17bd065f952a4ad00a9a], thank you for reporting the issue. Could you please provide your CSV file, as we're unable to exactly reproduce it ?

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: I've retried to reproduce the exact number of columns as [~accountid:5a4e17bd065f952a4ad00a9a] provided. There were some special characters in column names I've checked to be the cause. To no avail.

[~accountid:5a4e17bd065f952a4ad00a9a], would it be possible to provide your CSV file, or at least part of it ?

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Excuse me, but I guess I'm having this issue.

Just started testing H2O Flow. I'm trying to import correctly a fake data frame with 5000 rows. Just to tell the story, the data looks this:

!screenshot-1.png|thumbnail!

So, I tried to import in Flow, which imports correctly except by the fact that it put the header as the first data row, even I'm setting that is the header. Here, I tried all 3 options for no avail.

!screenshot-2.png|thumbnail!

!screenshot-3.png|thumbnail!

!screenshot-4.png|thumbnail!

I'm attaching the CSV file. [^dummydata.csv]

H2O version h2o-3.16.0.3 Java version "1.8.0_151" Linux 4.10.0-42-generic #46-Ubuntu x86_64 x86_64 x86_64 GNU/Linux

Thanks for listening! :)

Best,

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Hello [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869], thank you for your contribution. Your case is easily reproducible.

We're currently in the process of fixing this issue.

Best regards, Pavel

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: I've tried to reproduce the case described by [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] with the dummydata.csv he provided. Ubuntu machine, the same settings. Data parses correctly. We'll need to look into this more deeply.

I used recent master. Same result was achieved when I had rolled backed to jenkins-3.16.0.3.

!dummydata_parsed_master.png|thumbnail!

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Please let me know if you need anything from my computer to help your debugging.

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: I have to say that my work evolved here. I began using real data and Flow parsed the file successfully.

I really don't know what could have happened there.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Thank you [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869]. We're still working on reproducing this issue.

Your schema is the same, including the ",loja_estado,...." start of row with column names, only data changed ? We'll see if this issue is somehow related to encoding.

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: I can say the schema changed a little, yeah. Dropped some columns, another created, and data is real this time.

We'll see if this issue is somehow related to encoding.

The 'name' data column is one of those I dropped, and I had some of dummy data with accented chars on it, so yeah, it could be that.

Thanks for the follow up!

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Thank you [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] !

Which column in the dummydata.csv file you mentioned is 'name' ? Even when translated, it seems to me there is no column like name. Is this related to a dataset unknown to us ? Could you provde a sample ? At least a dummy column with the accented chars example.

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Hey!

You're right, I'm sorry. Strangely, the error happened with the file I gave you.

Aside that, I could manage to create a fake data file with accented chars. I'm attaching it here. Columns suffixed _nome or _bairro are what you want.

I can try again with the former dummydata.csv here, if you need it. Just let me know.

[^dummydata_full.csv]

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: I tried the [^dummydata_full.csv] provided by Mr. [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869]. To no avail.

!Snímek obrazovky pořízený 2018-01-22 23-11-54.png|thumbnail!

Is this this file source of your problems, Mr. [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] ?

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Actually, no. The problem was with [^dummydata.csv]. While using it to test how H20 works, this happened.

!screenshot-4.png|thumbnail!

I thought that it could be useful to report. Unfortunately, it didn't happen with real data.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869]

Two quick questions:

  1. Could you please share your H2O log wit us ?
  2. Can you reproduce it every time ?
exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Not-so-quick answers. Sorry about that, @Pavel Pscheidl.

  1. Sure! So I suppose I have to turn on H2O and try to parse some file, get the logs and send here, right?

  2. No, depends on the dataset.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] Thank you for your time, it is much appreciated. We're trying to reproduce it and so far there has been no success.

Whole log from the start to the error from H20, as you describe in point 1, would help simulate your environment.

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Right, I got the log here. I didn't use any extra parameters to get it, so let me know if you need any.

[^h2o.log]

Using [^dummydata.csv] dataset, I did the following:

started H2O, opened in the browser

clicked in "ImportFiles" from "Assistance" list

wrote the path to dummydata.csv file

clicked on the file from "Selected Files" list

clicked "Import"; "1/1 files imported" section appeared

clicked on "Parse these files"; "Setup Parse" section appeared, with some info about the file filled in

in "Setup Parse":

* change "sucesso" column type to "Enum"

* named column 1 as "funcionario_id"

* offtopic note: if I give column 1 a name and then changed "sucesso" column type, column 1 name is cleared

clicked on "Parse" button; Job for Parse section appeared

clicked on "View" button in Job section; "dummydata.hex" section appeared, showing 5001 Rows

clicked on "View Data" button; I get !screenshot-4.png|thumbnail!

Hope that will help you!

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] Thank you. We were finally able to reproduce your issue.

We'll also take a look on your offtopic note about column name being cleared (we were also able to reproduce this).

Best regards, Pavel

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: 🎉🎉🎉

Great! I'm glad to help you!

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Solution to this issue is enhancement of header parsing for CSV files. There was a PR https://github.com/h2oai/h2o-3/pull/2021 which I've closed for not being sufficient solution.

This specific issue is caused by H2O treating first row as a data after not all columns have been recognized. This happens when at least one of the columns is not matched against column names parsed from the file. This typically happens when user tries to renamed at least one column !

Root of this issue is method fileHasHeader() in CsvParser class, which enforces all column names to be the same, otherwise reports the file has no headers. This could be easily changed to a different strategy, e.g. matching column count and column types, or having at least one column name in common (won't work if all of them are renamed, and thus is wrong). However, such output breaks some of the tests and causes other workarounds built around this method to stop working.

Due to this problem, there will be a new PR. I propose to

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Described behavior could not be reproduced. Two identical CSVs were used, one with the very first row containing the column names and one without it. There were no differences in number of rows parsed when the only difference between the two CSVs was the header column being present or missing.

However, the behavior [~accountid:5a4e17bd065f952a4ad00a9a] describes can be reproduced in a slightly different way. Given a table with 3 rows, first row containing the column names:

{code:java} Name,Value Johny,25 Paul,35 {code}

such CSV is parsed correctly, resulting in 2 rows parsed. However, when a new line with a space or any non-printable character is added:

{code:java} Name,Value Johny,25 Paul,35 <-a single space or more spaces here {code}

The resulting frame contains new row with empty values.

The same behavior is observed when the first row does not contain column names.

!Snímek obrazovky pořízený 2018-01-08 09-36-03.png|thumbnail!

The theory is "explanatory variable" for an empty rows parsed is not the presence of a row with column names, but rather presence of an empty line with non-printable character present, e.g. a space. Most editors (LibreOffice Calc, Sublime etc.) tend to remove spaces and similar character on save in case of CSV format. A possible scenario would be :

Is the behavior discovered valid ? [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6]

exalate-issue-sync[bot] commented 1 year ago

Michal Malohlava commented: [~accountid:5a32df017dcf343865c26fa5] good catch! I believe it is a bug. I tried to modify your example and run a few additional experiments:

  1. injected an empty row WITHOUT space in the middle of dataset (no empty space at the end) {noformat} Name,Value Johny,25

Paul,35 {noformat} The dataset was parsed into a frame with 2rows as EXPECTED.

  1. injected a proper empty row (contains a delimiter) in the middle of dataset {noformat} Name,Value Johny,25 , Paul,35 {noformat} This dataset was parsed into a frame with 3row. But this is EXPECTED as well.

  2. Finally, injected an empty row with a leading space in the middle of dataset (no empty space at the end) {noformat} Name,Value Johny,25 <- a space character Paul,35 {noformat} The dataset was parsed into a frame with 3rows. This is UNEXPECTED, but explainable - we treat a row as an incomplete row with a value in the first column (empty value) I believe, that this is wrong behavior - we should ignore the row since it does not contain delimiter and the value of first column is "empty" (i do not remember any use case where people were using whitespace characters to express value)

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: What I find weird is that the first column name is actually an empty string (see the parseFiles statement).

I think what the user is saying is that he first uploads the csv with a header line and then without a header line. The upload with a header line produces one extra row in the Frame. This makes me think there is a problem with the header line itself and not in the rest of the file.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Hello [~accountid:5a4e17bd065f952a4ad00a9a], thank you for reporting the issue. Could you please provide your CSV file, as we're unable to exactly reproduce it ?

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: I've retried to reproduce the exact number of columns as [~accountid:5a4e17bd065f952a4ad00a9a] provided. There were some special characters in column names I've checked to be the cause. To no avail.

[~accountid:5a4e17bd065f952a4ad00a9a], would it be possible to provide your CSV file, or at least part of it ?

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Excuse me, but I guess I'm having this issue.

Just started testing H2O Flow. I'm trying to import correctly a fake data frame with 5000 rows. Just to tell the story, the data looks this:

!screenshot-1.png|thumbnail!

So, I tried to import in Flow, which imports correctly except by the fact that it put the header as the first data row, even I'm setting that is the header. Here, I tried all 3 options for no avail.

!screenshot-2.png|thumbnail!

!screenshot-3.png|thumbnail!

!screenshot-4.png|thumbnail!

I'm attaching the CSV file. [^dummydata.csv]

H2O version h2o-3.16.0.3 Java version "1.8.0_151" Linux 4.10.0-42-generic #46-Ubuntu x86_64 x86_64 x86_64 GNU/Linux

Thanks for listening! :)

Best,

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Hello [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869], thank you for your contribution. Your case is easily reproducible.

We're currently in the process of fixing this issue.

Best regards, Pavel

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: I've tried to reproduce the case described by [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] with the dummydata.csv he provided. Ubuntu machine, the same settings. Data parses correctly. We'll need to look into this more deeply.

I used recent master. Same result was achieved when I had rolled backed to jenkins-3.16.0.3.

!dummydata_parsed_master.png|thumbnail!

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Please let me know if you need anything from my computer to help your debugging.

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: I have to say that my work evolved here. I began using real data and Flow parsed the file successfully.

I really don't know what could have happened there.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Thank you [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869]. We're still working on reproducing this issue.

Your schema is the same, including the ",loja_estado,...." start of row with column names, only data changed ? We'll see if this issue is somehow related to encoding.

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: I can say the schema changed a little, yeah. Dropped some columns, another created, and data is real this time.

We'll see if this issue is somehow related to encoding.

The 'name' data column is one of those I dropped, and I had some of dummy data with accented chars on it, so yeah, it could be that.

Thanks for the follow up!

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Thank you [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] !

Which column in the dummydata.csv file you mentioned is 'name' ? Even when translated, it seems to me there is no column like name. Is this related to a dataset unknown to us ? Could you provde a sample ? At least a dummy column with the accented chars example.

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Hey!

You're right, I'm sorry. Strangely, the error happened with the file I gave you.

Aside that, I could manage to create a fake data file with accented chars. I'm attaching it here. Columns suffixed _nome or _bairro are what you want.

I can try again with the former dummydata.csv here, if you need it. Just let me know.

[^dummydata_full.csv]

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: I tried the [^dummydata_full.csv] provided by Mr. [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869]. To no avail.

!Snímek obrazovky pořízený 2018-01-22 23-11-54.png|thumbnail!

Is this this file source of your problems, Mr. [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] ?

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Actually, no. The problem was with [^dummydata.csv]. While using it to test how H20 works, this happened.

!screenshot-4.png|thumbnail!

I thought that it could be useful to report. Unfortunately, it didn't happen with real data.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869]

Two quick questions:

  1. Could you please share your H2O log wit us ?
  2. Can you reproduce it every time ?
exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Not-so-quick answers. Sorry about that, @Pavel Pscheidl.

  1. Sure! So I suppose I have to turn on H2O and try to parse some file, get the logs and send here, right?

  2. No, depends on the dataset.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] Thank you for your time, it is much appreciated. We're trying to reproduce it and so far there has been no success.

Whole log from the start to the error from H20, as you describe in point 1, would help simulate your environment.

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: Right, I got the log here. I didn't use any extra parameters to get it, so let me know if you need any.

[^h2o.log]

Using [^dummydata.csv] dataset, I did the following:

started H2O, opened in the browser

clicked in "ImportFiles" from "Assistance" list

wrote the path to dummydata.csv file

clicked on the file from "Selected Files" list

clicked "Import"; "1/1 files imported" section appeared

clicked on "Parse these files"; "Setup Parse" section appeared, with some info about the file filled in

in "Setup Parse":

* change "sucesso" column type to "Enum"

* named column 1 as "funcionario_id"

* offtopic note: if I give column 1 a name and then changed "sucesso" column type, column 1 name is cleared

clicked on "Parse" button; Job for Parse section appeared

clicked on "View" button in Job section; "dummydata.hex" section appeared, showing 5001 Rows

clicked on "View Data" button; I get !screenshot-4.png|thumbnail!

Hope that will help you!

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: [~accountid:557058:1565ae11-487b-47a5-95b1-43c93e3a7869] Thank you. We were finally able to reproduce your issue.

We'll also take a look on your offtopic note about column name being cleared (we were also able to reproduce this).

Best regards, Pavel

exalate-issue-sync[bot] commented 1 year ago

Paulo Cheadi Haddad Filho commented: 🎉🎉🎉

Great! I'm glad to help you!

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Solution to this issue is enhancement of header parsing for CSV files. There was a PR https://github.com/h2oai/h2o-3/pull/2021 which I've closed for not being sufficient solution.

This specific issue is caused by H2O treating first row as a data after not all columns have been recognized. This happens when at least one of the columns is not matched against column names parsed from the file. This typically happens when user tries to renamed at least one column !

Root of this issue is method fileHasHeader() in CsvParser class, which enforces all column names to be the same, otherwise reports the file has no headers. This could be easily changed to a different strategy, e.g. matching column count and column types, or having at least one column name in common (won't work if all of them are renamed, and thus is wrong). However, such output breaks some of the tests and causes other workarounds built around this method to stop working.

Due to this problem, there will be a new PR. I propose to

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5198 Assignee: New H2O Bugs Reporter: Michael Vergauwen State: Open Fix Version: N/A Attachments: Available (Count: 12) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/2021 https://github.com/h2oai/h2o-3/pull/2121

Attachments From Jira

Attachment Name: CSVwithcolumnnames.png Attached By: Michael Vergauwen File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/CSVwithcolumnnames.png

Attachment Name: CSVwithoutcolumnnames.png Attached By: Michael Vergauwen File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/CSVwithoutcolumnnames.png

Attachment Name: dummydata_full.csv Attached By: Paulo Cheadi Haddad Filho File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/dummydata_full.csv

Attachment Name: dummydata_parsed_master.png Attached By: Pavel Pscheidl File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/dummydata_parsed_master.png

Attachment Name: dummydata.csv Attached By: Paulo Cheadi Haddad Filho File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/dummydata.csv

Attachment Name: h2o.log Attached By: Paulo Cheadi Haddad Filho File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/h2o.log

Attachment Name: screenshot-1.png Attached By: Paulo Cheadi Haddad Filho File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/screenshot-1.png

Attachment Name: screenshot-2.png Attached By: Paulo Cheadi Haddad Filho File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/screenshot-2.png

Attachment Name: screenshot-3.png Attached By: Paulo Cheadi Haddad Filho File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/screenshot-3.png

Attachment Name: screenshot-4.png Attached By: Paulo Cheadi Haddad Filho File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/screenshot-4.png

Attachment Name: Snímek obrazovky pořízený 2018-01-08 09-36-03.png Attached By: Pavel Pscheidl File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/Snímek obrazovky pořízený 2018-01-08 09-36-03.png

Attachment Name: Snímek obrazovky pořízený 2018-01-22 23-11-54.png Attached By: Pavel Pscheidl File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5198/Snímek obrazovky pořízený 2018-01-22 23-11-54.png