red-data-tools / red-datasets

A RubyGem that provides common datasets
MIT License
30 stars 25 forks source link

Fix Rdatasets#each to change "NA" to nil #139

Closed heronshoes closed 2 years ago

heronshoes commented 2 years ago

This pull request is a solution about #138 .

$ bundle exec benchmark-driver benchmark_rdataset.yml

(snipped)
         original     21.516 i/s -      65.000 times in 3.021053s (46.48ms/i)
         new code     28.640 i/s -      90.000 times in 3.142489s (34.92ms/i)

benchmark_rdataset.yml

  prelude: |
    require 'datasets'
    dataset = Datasets::Rdatasets.new('datasets', 'quakes')

  benchmark:
    'new code': 'dataset.to_a'
heronshoes commented 2 years ago

When I used CSV.table without option, header names are automatically converted like;

{:ozone=>20, :solarr=>223, :wind=>11.5, :temp=>68, :month=>9, :day=>30},

should be;

{:Ozone=>20, :"Solar.R"=>223, :Wind=>11.5, :Temp=>68, :Month=>9, :Day=>30}

So I added an option for header and symbolize later when pass to block.

heronshoes commented 2 years ago

CI test was failed (for a month?) because size of RdatasetsList is too small in cache.

It is reverting @kou 's 'test rdatasets: update expected'(f42e9e8c1d73b6750e598b3fad2ec4ca1cc19ced)

heronshoes commented 2 years ago

After this fix applied, 265 datasets in Rdatasets are convertible to Arrow table.

https://gist.github.com/heronshoes/e6e4a9f093000f2f7435345874987848

heronshoes commented 2 years ago

Ruby 2.6/windows test was failed by unintended TCP error so I pushed again with the same code to restart CI test.

kou commented 2 years ago

Thanks!