Tracking progress of `censive` performance

shreeve commented 1 year ago

Per @kou - Moving this conversation over here from https://github.com/ruby/strscan/issues/53

I worked on several optimizations and found the main one that the CSV library uses, which was fairly easy to implement and a little faster too.

What is the optimization? String#split?

Yes, the code is here, and is enabled in the @cheat instance variable, until and if it gets deactivated.

  def next_row
    if @cheat and line = scan_until(@eol)
      row = line.chomp!.split(@sep, -1)
      row.each do |col|
        next if (saw = col.count(@quote)).zero?
        next if (saw == 2) && col.delete_prefix!(@quote) && col.delete_suffix!(@quote)
        @cheat = false
        break
      end if line.include?(@quote)
      @cheat and return @strip ? row.each(&:strip!) : row
      unscan
    end

    token = next_token or return
    row = []
    row.push(*token)
    row.push(*token) while token = next_token
    row
  end

There is another optimization, even larger, where you could do this for the entire file, if it's determined that the file doesn't contain a single quote. But, this seems uncommon and is probably not worth it.

shreeve commented 1 year ago

@kou - The main thing now is that I can do comparisons directly with the main cvs library and show that censive is significantly faster in all the cases that I've tested. It was slower in the case of the KEN_ALL.CSV data, but that is changed now, with these numbers:

censive/lib> time test-csv.rb
3bda79d8a1d5b062bae6bae9dee3db63 KEN_ALL.CSV (12337910 size)
test-csv.rb  0.98s user 0.06s system 79% cpu 1.312 total

censive/lib> time test-censive.rb
3bda79d8a1d5b062bae6bae9dee3db63 KEN_ALL.CSV (12337910 size)
test-censive.rb  0.81s user 0.04s system 87% cpu 0.978 total

In this case, with the KEN_ALL.CSV data, the results are (in this example):

1.312 seconds / 0.978 seconds = 1.34x speedup for censive

shreeve commented 1 year ago

Another advantage is that the code is a fraction of the size. The censive library is not fully done yet, but is it very usable as-is and has support already for some nice features and API. Still work to do to finish it all off though.

The main thing is was @kou's suggestion on how to better use the strscan library.

shreeve commented 1 year ago

@kou - I'm also working on a much easier benchmarking tool called flay, which I plan to use instead of benchmark-driver. As that code comes along, I'd love to do several other benchmarks to compare with strscan and csv and several other libraries. This has been a lot of fun and I think it's nice to help speedup code that many people are using.

shreeve commented 1 year ago

The flay benchmarking tool has been moved to a new gem called winr.

shreeve commented 1 year ago

Here's the result of running winr to compare censive and csv:

winr

And here is the winr config file for this:

{
  environments: [
    {
      begin: <<~"|".strip,
        require "digest/md5"

        path = ARGV[0] || "KEN_ALL.CSV"
        mode = path =~ /^ken/i ? "r:cp932" : "r"
        data = File.open(path, mode).read
      |
    }
  ],
  tasks: [
    {
      name: "csv",
      begin: "require 'csv'",
      script: <<~"|".strip,
        rows = CSV.parse(data)
        puts "%s %s (%d size)" % [Digest::MD5.hexdigest(rows.join), path, File.stat(path).size], ""
      |
    },
    {
      name: "censive",
      begin: "require 'censive'",
      script: <<~"|".strip,
        rows = Censive.parse(data)
        puts "%s %s (%d size)" % [Digest::MD5.hexdigest(rows.join), path, File.stat(path).size], ""
      |
    },
  ],
}

shreeve commented 1 year ago

@kou - Including you here for reference. Now, I'll go back to strscan and will try to add the issue to allow scan_until(string).

kou commented 1 year ago

Thanks!

shreeve / censive

Tracking progress of `censive` performance #1