Fix Rdatasets#each with mixed data of numeric and string

red-data-tools / red-datasets

A RubyGem that provides common datasets

MIT License

30 stars 25 forks source link

Fix Rdatasets#each with mixed data of numeric and string #140

Closed heronshoes closed 1 year ago

heronshoes commented 2 years ago

This pull request is a solution of 2nd part in https://github.com/red-data-tools/red-datasets/issues/138 .

This code is based on #139 and at 1 commit ahead. If #139 merged, I will rebase this on master.

kou commented 2 years ago

Thanks!

But I want to avoid rechecking all values in each column for performance... It seems that CSVs in Rdatasets are written by write.csv https://github.com/vincentarelbundock/Rdatasets/blob/master/scrape.R#L56 and write.csv uses double quote for factor values:

https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/write.table

quote

a logical value (TRUE or FALSE) or a numeric vector. If TRUE, any character or factor columns will be surrounded by double quotes. If a numeric vector, its elements are taken as the indices of columns to quote. In both cases, row and column names are quoted if they are written. If FALSE, nothing is quoted.

Can we use this instead of rechecking all values?

We may need to improve ruby/csv because it doesn't provide "quoted" information to users...

kou commented 2 years ago

We may be able to pass "quoted" information to converters via CSV::FieldInfo: https://github.com/ruby/csv/blob/master/lib/csv/fields_converter.rb#L66

heronshoes commented 2 years ago

Thank you for your advice. I agree with you that this code is redundant. I will try to use FieldInfo.

I will appreciate it if you could continue to support.

heronshoes commented 2 years ago

This will fix 53/55 errors. Remaining 2 datasets are [["drc", "germination"], ["validate", "nace_rev2"]]

"germination" contains "Inf" in numeric column.

"","temp","species","start","end","germinated"
"1",10,"wheat",0,1,0
 :
"16",10,"wheat",17,18,0
"17",10,"wheat",18,Inf,2

heronshoes commented 2 years ago

This change will cover both "germination" and "nace_rev2". I added tests for them. For the "nace_rev2" is temporally one. I think we should some refinement in CSV library itself.

heronshoes commented 2 years ago

I want to hold this until 'quoted?' information is available from CSV. I memorize the fixes needed.

Use start_with?/end_with? instead of regexp.
Use begin/rescue for readability.
Will not use :quote_char option when CSV improved.
Use :symbol_raw header_converter in CSV.

heronshoes commented 1 year ago

Sorry for holding so long !

I reflected CSV's 'quoted' feature and fixed remaining issues.

Temporally I specify HEAD version of CSV to use the new feature.

# in Gemfile
gem 'csv', github: 'ruby/csv'

I would appreciate it if you could comment.

kou commented 1 year ago

Thanks!