oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
3.03k stars 185 forks source link

CSV.read 2x slower than MRI #1072

Closed thbar closed 4 years ago

thbar commented 6 years ago

I made a test to compare speed of reading a CSV file with MRI 2.5.0 and TruffleRuby 0.30.2. It can be found here.

Copying it to make sure the reference doesn't break later:

require_relative 'helper'
require 'csv'

class TestParser < Minitest::Test
  def measure(msg)
    t = Time.now
    return yield
  ensure
    puts "Measure #{msg} - #{(Time.now - t).round(2)} seconds"
  end

  def filepath
    'data/sample.csv'
  end

  def setup
    return if File.exist?(filepath)
    File.open(filepath, 'w') do |f|
      f << ['index', 'some_string'].to_csv
      500_000.times do |i|
        f << [i, "SomeString-#{i}"].to_csv
      end
    end
  end

  def test_works
    count = 0
    measure("csv-foreach") do
      CSV.foreach(filepath, headers: true) do |row|
        count += 1
      end
    end
    assert_equal 500_000, count
  end
end

I'm only counting the time to read, not the time to write.

I took care not to use exotic encodings (e.g. ISO-8859-1) for now.

Here is the outcome:

chrisseaton commented 6 years ago

Thanks for the report!

We've got some docs on reporting performance problems https://github.com/oracle/truffleruby/blob/master/doc/user/reporting-performance-problems.md.

nirvdrum commented 6 years ago

FYI, I've begun looking into this issue and have submitted a few fixes already. Unfortunately, they didn't land in time for the next release (1.0.0 RC2), so you'll either need to build from master or wait for 1.0.0 RC3. But, I wanted you to know I haven't forgotten about your report.

thbar commented 6 years ago

@nirvdrum thanks for the update! FYI, I've re-discussed that with @eregon at RubyKaigi.

In all cases no hurry on anything, I'm just trying out things, no immediate real production need :-)

eregon commented 6 years ago

I created a small standalone script to measure CSV parsing with CSV.foreach for a 10k-rows CSV file, as used in the latest kiba-ruby-benchmarks: https://gist.github.com/eregon/577d2716c517aa2ec3018aa359073115

It gets faster after warmup, but it's still slower than MRI currently.

gogainda commented 4 years ago

@eregon Checked latest dev build vs ruby 2.7. After couple of rounds of warmup, it finished with comparable to MRI performance.

truffleruby 20.2.0-dev-91e7ba04, like ruby 2.6.5, GraalVM CE Native [x86_64-linux]

24.871960829999807
28.58095245200002
23.962383978000616
9.586286597999788
6.702010275999783
6.586533140000029
6.447069820999786
6.553509518999817
6.646038615999714
6.462447326999609
ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux]

5.9263605000005555
5.8942821059999915
5.90000361500006
5.886332756999764
5.880790037000224
5.912058664999677
5.898451475000002
5.8683086130004085
5.882981536999978
5.879075885999555

Code I used:

require 'csv'
require 'benchmark'

dir = "data"
Dir.mkdir(dir) unless Dir.exist?(dir)

input = "#{dir}/extract.csv"

unless File.exist?(input)
  Dir.chdir(dir) do
    zip = "sirene_2017111_E_Q.zip"
    system "wget", "-O", zip, "http://files.data.gouv.fr/sirene/#{zip}"
    system "unzip", zip
    system "head -10000 sirc* > extract.csv"
  end
end

10.times do
  p Benchmark.realtime {
    CSV.foreach(input, col_sep: ';', encoding: "ISO-8859-1:UTF-8") { |row|
    }
  }
end
eregon commented 4 years ago

Thanks, I think this is close enough to MRI to not be "slower" anymore (at least not 2x). Maybe it can be further optimized but the performance of that benchmark for CSV seems reasonable now, so I'd like to close this issue.

thbar commented 4 years ago

@gogainda thanks for checking!

@eregon agreed, this looks perfectly acceptable for now. I will provide other ETL-based benchmarks in the future ;-)