Closed heronshoes closed 1 year ago
Thanks!
But I want to avoid rechecking all values in each column for performance... It seems that CSVs in Rdatasets are written by write.csv
https://github.com/vincentarelbundock/Rdatasets/blob/master/scrape.R#L56 and write.csv
uses double quote for factor values:
https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/write.table
quote
a logical value (TRUE or FALSE) or a numeric vector. If TRUE, any character or factor columns will be surrounded by double quotes. If a numeric vector, its elements are taken as the indices of columns to quote. In both cases, row and column names are quoted if they are written. If FALSE, nothing is quoted.
See also raw CSV for attenu: https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/attenu.csv
Can we use this instead of rechecking all values?
We may need to improve ruby/csv because it doesn't provide "quoted" information to users...
We may be able to pass "quoted" information to converters via CSV::FieldInfo
: https://github.com/ruby/csv/blob/master/lib/csv/fields_converter.rb#L66
Thank you for your advice. I agree with you that this code is redundant. I will try to use FieldInfo.
I will appreciate it if you could continue to support.
This will fix 53/55 errors. Remaining 2 datasets are [["drc", "germination"], ["validate", "nace_rev2"]]
"germination" contains "Inf" in numeric column.
"","temp","species","start","end","germinated"
"1",10,"wheat",0,1,0
:
"16",10,"wheat",17,18,0
"17",10,"wheat",18,Inf,2
This change will cover both "germination" and "nace_rev2". I added tests for them. For the "nace_rev2" is temporally one. I think we should some refinement in CSV library itself.
I want to hold this until 'quoted?' information is available from CSV. I memorize the fixes needed.
Sorry for holding so long !
I reflected CSV's 'quoted' feature and fixed remaining issues.
Temporally I specify HEAD version of CSV to use the new feature.
# in Gemfile
gem 'csv', github: 'ruby/csv'
I would appreciate it if you could comment.
Thanks!
This pull request is a solution of 2nd part in https://github.com/red-data-tools/red-datasets/issues/138 .
This code is based on #139 and at 1 commit ahead. If #139 merged, I will rebase this on master.