ruby / strscan

Provides lexical scanning operations on a String.
BSD 2-Clause "Simplified" License
81 stars 30 forks source link

Adding two methods: `currchar` and `nextchar` #53

Closed shreeve closed 1 year ago

shreeve commented 1 year ago

The strscan library is great for high performance work. To read "the next character", we can use the getch method. However, it returns the character at the current position, but then advances the pointer to the next character. This is exactly what is needed in many cases. However, sometimes we just need to access the current character, without advancing the pointer. We can't always use peek(1), because it is not multibyte aware. What we need here is a "get the current character" method. Also, there are times when we know the current character, but we want to advance to and get the next character. A method called nextchar would work great.

Here's a diagram:

strscan

Adding these two methods: currchar and nextchar to strscan allows us to gain two very helpful methods, which are both very performant. Using these, I've been able to increase the parsing speed of one of my CSV libraries by 60%.

shreeve commented 1 year ago

Here is some in-progress code for a CSV parser I'm working on:

#!/usr/bin/env ruby

# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
#   Date: Feb 4, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
# https://github.com/ruby/strscan/issues/53 for details
# https://github.com/ruby/strscan/pull/54 for code
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code base with streamlined logic
# 3. Support for most non-compliant CSV variations (eg - @relax, @excel)
#
# TODO:
# 1. Support IO streaming
# 2. Add option to strip whitespace
# 3. Support CSV headers in first row
# ============================================================================

require "bundler/setup"
require "strscan"

class Censive < StringScanner

  def self.writer(obj=nil, **opts, &code)
    case obj
    when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
    when IO,nil then new(out: obj, **opts, &code)
    else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
    end
  end

  def initialize(str=nil,
    drop:  false   , # drop trailing empty fields?
    eol:   "\n"    , # line endings for exports
    excel: false   , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
    mode:  :compact, # export mode: compact or full
    out:   nil     , # output stream, needs to respond to <<
    quote: '"'     , # quote character
    relax: false   , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
    sep:   ","     , # column separator character
    **opts           # grab bag
  )
    super(str || "")
    reset

    @drop   = drop
    @eol    = eol
    @excel  = excel
    @mode   = mode
    @out    = out || $stdout
    @quote  = quote
    @relax  = relax
    @sep    = sep

    @cr     = "\r"
    @lf     = "\n"
    @es     = ""
    @eq     = "="
    @esc    = (@quote * 2)
  end

  def reset(str=nil)
    self.string = str if str
    super()
    @char = curr_char
    @rows = nil
    @cols = @cells = 0
  end

  # ==[ Lexer ]==

  # pure ruby versions for debugging
  # def curr_char;             @char = string[pos]; end
  # def next_char; scan(/./m); @char = string[pos]; end

  def curr_char; @char = currchar; end
  def next_char; @char = nextchar; end

  def next_token
    if @excel && @char == @eq
      excel = true
      next_char
    end

    if @char == @quote # consume quoted cell
      token = ""
      while true
        next_char
        token << (scan_until(/(?=#{@quote})/o) or bomb "unclosed quote")
        token << @quote and next if next_char == @quote
        break if [@sep,@cr,@lf,@es,nil].include?(@char)
        @relax or bomb "invalid character after quote"
        token << @quote + scan_until(/(?=#{@quote})/o) + @quote
      end
      next_char if @char == @sep
      token
    elsif [@sep,@cr,@lf,@es,nil].include?(@char)
      case @char
      when @sep then next_char                     ; @es
      when @cr  then next_char == @lf and next_char; nil
      when @lf  then next_char                     ; nil
      else                                           nil
      end
    else # consume unquoted cell
      token = scan_until(/(?=#{@sep}|#{@cr}|#{@lf}|\z)/o) or bomb "unexpected character"
      token.prepend(@eq) if excel
      next_char if curr_char == @sep
      token
    end
  end

  def bomb(msg)
    abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
  end

  # ==[ Parser ]==

  def parse
    @rows = []
    while row = next_row
      @rows << row
      count = row.size
      @cols = count if count > @cols
      @cells += count
    end
    @rows
  end

  def next_row
    token = next_token or return
    row = [token]
    row << token while token = next_token
    row
  end

  # ==[ Helpers ]==

  # returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
  def grok(str)
    if idx = str.index(/(#{@quote})|#{@sep}|#{@cr}|#{@lf}/o) #!# FIXME: regex injection?
      $1 ? 2 : str.index(/#{@quote}/o, idx) ? 2 : 1
    else
      0
    end
  end

  # output a row
  def <<(row)

    # drop trailing empty columns
    row.pop while row.last.empty? if @drop

    s,q = @sep, @quote
    out = case @mode
    when :compact
      case @excel ? 2 : grok(row.join)
      when 0
        row
      when 1
        row.map do |col|
          col.match?(/#{@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
        end
      else
        row.map do |col|
          @excel && col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" :
          case grok(col)
          when 0 then col
          when 1 then "#{q}#{col}#{q}"
          else        "#{q}#{col.gsub(q, @esc)}#{q}"
          end
        end
      end
    when :full
      if @excel
        row.map do |col|
          col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
        end
      else
        row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
      end
    end.join(s)

    @out << out + @eol
  end

  def each
    @rows ||= parse
    @rows.each {|row| yield row }
  end

  def export(**opts)
    out = opts.empty? ? self : self.class.writer(**opts)
    each {|row| out << row }
  end

  def stats
    wide = string.size.to_s.size
    puts "%#{wide}d rows"    % @rows.size
    puts "%#{wide}d columns" % @cols
    puts "%#{wide}d cells"   % @cells
    puts "%#{wide}d bytes"   % string.size
  end
end

if __FILE__ == $0
  raw = DATA.read
  csv = Censive.new(raw, excel: true, relax: true)
  csv.export # (sep: ",", excel: true)
end

__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"
shreeve commented 1 year ago

For some performance numbers from the baseline Ruby version vs. these methods in native strscan:

Baseline Ruby methods

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 21.08     5.37      2.04   226917     0.01     0.01  Censive#next_char
 13.52     6.67      1.31   188985     0.01     0.01  Censive#curr_char

Implemented in strscan

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
  9.06     4.85      0.61   226917     0.00     0.00  Censive#next_char
  7.39     5.35      0.50   188985     0.00     0.00  Censive#curr_char

Speedup

nextchar speedup = 2.04 secs / 0.61 secs => 3.34x faster
currchar speedup = 1.31 secs / 0.50 secs => 2.62x faster
kou commented 1 year ago

Why do you want to compare the current character by yourself instead of StringScanner API? For example, why do you use if @char == @quote # consume quoted cell instead of if scan(@quote)?

shreeve commented 1 year ago

@kou - There are many reasons. In the case you cited, the value for the current character was previously read, so it's already available as a variable. As we are parsing the input, there are many cases when a character is being checked and does not match, but we don't necessarily want to go back to the scanner and recheck it again if we already have the value.

Another reason is that the scan method only takes a regex and it also consumes it's input, etc. The way the scanner is used is highly dependent on the logic in the program. In the case of the censive library (the code I pasted above), we have to very carefully control when we invoke the scanner, when we can reuse a previous character, when we advance the pointer or not, carefully manage where we are in the string, etc.

The goal of PR #54 discussed here is to simply add two new very helpful methods for using strscan.

Using curr_char simply returns the current character, which depends on what we've done so far, and upon method calls such as scan or scan_until, where the final/current character might not be known. This can easily happen when doing something like this: s.scan_until(/(?=#{@quote})/o). In this case, I'm asking strscan to scan upto, but not include, the final character (in this case here, it's up to a quote). There are many cases where this is extremely helpful, but currently we cannot do this with strscan, but curr_char does this in a performant and simple way.

Using next_char is extremely useful when, for whatever reason, I already know the current character and I just need to advance one more character and return it, in one efficient call.

There is currently not a simple, highly performant way to do this in strscan. This PR #54 solves this in a simple way.

kou commented 1 year ago

I don't think that you need to extract the current character and compare it out of StringScanner. You can just use scan:

#!/usr/bin/env ruby

# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
#   Date: Feb 4, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
# https://github.com/ruby/strscan/issues/53 for details
# https://github.com/ruby/strscan/pull/54 for code
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code base with streamlined logic
# 3. Support for most non-compliant CSV variations (eg - @relax, @excel)
#
# TODO:
# 1. Support IO streaming
# 2. Add option to strip whitespace
# 3. Support CSV headers in first row
# ============================================================================

require "bundler/setup"
require "strscan"

class Censive < StringScanner

  def self.writer(obj=nil, **opts, &code)
    case obj
    when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
    when IO,nil then new(out: obj, **opts, &code)
    else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
    end
  end

  def initialize(str=nil,
    drop:  false   , # drop trailing empty fields?
    eol:   "\n"    , # line endings for exports
    excel: false   , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
    mode:  :compact, # export mode: compact or full
    out:   nil     , # output stream, needs to respond to <<
    quote: '"'     , # quote character
    relax: false   , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
    sep:   ","     , # column separator character
    **opts           # grab bag
  )
    super(str || "")
    reset

    @drop   = drop
    @eol    = eol
    @excel  = excel
    @mode   = mode
    @out    = out || $stdout
    @quote  = quote
    @relax  = relax
    @sep    = sep

    @cr     = "\r"
    @lf     = "\n"
    @es     = ""
    @eq     = "="
    @esc    = (@quote * 2)
  end

  def reset(str=nil)
    self.string = str if str
    super()
    @rows = nil
    @cols = @cells = 0
  end

  # ==[ Lexer ]==

  # pure ruby versions for debugging
  #def curr_char;             @char = string[pos]; end
  #def next_char; scan(/./m); @char = string[pos]; end

  def curr_char; @char = currchar; end
  def next_char; @char = nextchar; end

  def next_token
    return nil if eos?

    if @excel && scan(@eq)
      excel = true
    end

    if scan(@quote) # consume quoted cell
      token = (scan_until(/#{Regexp.escape(@quote)}/o) or bomb "unclosed quote")[0..-2]
      scan(@sep)
      token
    elsif scan(@sep)
      @es
    elsif scan(@cr)
      scan(@lf)
      nil
    elsif scan(@lf)
      nil
    else # consume unquoted cell
      token = scan_until(/(?=#{@sep}|#{@cr}|#{@lf}|\z)/o) or bomb "unexpected character"
      token.prepend(@eq) if excel
      scan(@sep)
      token
    end
  end

  def bomb(msg)
    abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
  end

  # ==[ Parser ]==

  def parse
    @rows = []
    while row = next_row
      @rows << row
      count = row.size
      @cols = count if count > @cols
      @cells += count
    end
    @rows
  end

  def next_row
    token = next_token or return
    row = [token]
    row << token while token = next_token
    row
  end

  # ==[ Helpers ]==

  # returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
  def grok(str)
    if idx = str.index(/(#{@quote})|#{@sep}|#{@cr}|#{@lf}/o) #!# FIXME: regex injection?
      $1 ? 2 : str.index(/#{@quote}/o, idx) ? 2 : 1
    else
      0
    end
  end

  # output a row
  def <<(row)

    # drop trailing empty columns
    row.pop while row.last.empty? if @drop

    s,q = @sep, @quote
    out = case @mode
    when :compact
      case @excel ? 2 : grok(row.join)
      when 0
        row
      when 1
        row.map do |col|
          col.match?(/#{@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
        end
      else
        row.map do |col|
          @excel && col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" :
          case grok(col)
          when 0 then col
          when 1 then "#{q}#{col}#{q}"
          else        "#{q}#{col.gsub(q, @esc)}#{q}"
          end
        end
      end
    when :full
      if @excel
        row.map do |col|
          col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
        end
      else
        row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
      end
    end.join(s)

    @out << out + @eol
  end

  def each
    @rows ||= parse
    @rows.each {|row| yield row }
  end

  def export(**opts)
    out = opts.empty? ? self : self.class.writer(**opts)
    each {|row| out << row }
  end

  def stats
    wide = string.size.to_s.size
    puts "%#{wide}d rows"    % @rows.size
    puts "%#{wide}d columns" % @cols
    puts "%#{wide}d cells"   % @cells
    puts "%#{wide}d bytes"   % string.size
  end
end

if __FILE__ == $0
  raw = DATA.read
  csv = Censive.new(raw, excel: true, relax: true)
  csv.export # (sep: ",", excel: true)
end

__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"

The goal of PR #54 discussed here is to simply add two new very helpful methods for using strscan.

In general, I don't want to add APIs that don't match recommended usage.

shreeve commented 1 year ago

Wow!!!

Your suggestions were very helpful! I was able to make some tweaks and now the code is running significantly faster, with just the standard scan method from the strscan code. I am really impressed and the code is actually simpler also. I have now put the additional code in that is needed for handling common CSV variants, etc. I think the result looks great and it's much faster and cleaner.

I have a mid-size CSV file that I have been testing against the standard CSV library that comes with Ruby. In that rough test, the default CSV library processes that file in about 12.5 seconds, whereas the latest updates here allow me to process the same file in 4.67 seconds. Thank you so much for the idea on how to do this better. I'll close the pull request, since it wasn't needed.

Now that I see how helpful the raw scan command is, there is one potential PR that could be helpful. I didn't realize before that the scan method can accept a string for the argument (I thought it only took regex values). When you pass a string to scan, it uses an optimization to make the processing very quick. However, the scan_until method does not allow the same optimization for strings, since it only accepts regex values. We could potentially benefit from a scan_until method that also accepts a has an optimized path for processings strings. It would also bring scan_until more closely in alignment with the scan method (ie - why should scan accept a string or a regex, where scan_until can only accept a regex?).

Again, thank you for your suggestions!

shreeve commented 1 year ago

Here's the updated code, with your suggestions. I still have a few things to tweak, but I'm almost there. :-)

#!/usr/bin/env ruby

# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
#   Date: Feb 5, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
#
# Thanks to Sutou Kouhei (kou) for his excellent advice on using scan
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code with streamlined and optimized logic
# 3. Support most non-compliant CSV variations (eg - @excel, @relax, @strip)
#
# TODO: Support IO streaming
# ============================================================================

require "strscan"

class Censive < StringScanner

  def self.writer(obj=nil, **opts, &code)
    case obj
    when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
    when IO,nil then new(out: obj, **opts, &code)
    else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
    end
  end

  def initialize(str=nil,
    drop:   false   , # drop trailing empty fields?
    excel:  false   , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
    mode:   :compact, # export mode: compact or full
    out:    $stdout , # output stream, needs to respond to <<
    quote:  '"'     , # quote character
    relax:  false   , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
    rowsep: "\n"    , # row separator for export
    sep:    ","     , # column separator character
    strip:  false   , # strip fields when reading
    **opts            # grab bag
  )
    super(str || "")
    reset

    # options
    @drop    = drop
    @excel   = excel
    @mode    = mode
    @out     = out
    @quote   = quote
    @relax   = relax
    @rowsep  = rowsep
    @sep     = sep
    @strip   = strip

    # determined
    @cr  = "\r"
    @lf  = "\n"
    @es  = ""
    @eq  = "="
    @esc = (@quote * 2)
    @eol = /#{@cr}#{@lf}?|#{@lf}|\z/o             # end of line
    @eoc = /(?=#{"\\" + @sep}|#{@cr}|#{@lf}|\z)/o # end of cell
  end

  def reset(str=nil)
    self.string = str if str
    super()
    @rows = nil
    @cols = @cells = 0
  end

  # ==[ Lexer ]==

  def next_token
    excel = true if @excel && scan(@eq)

    if scan(@quote) # consume quoted cell
      token = ""
      while true
        token << (scan_until(/#{@quote}/o) or bomb "unclosed quote")[0..-2]
        token << @quote and next if scan(@quote)
        break if scan(@eoc)
        @relax or bomb "invalid character after quote"
        token << @quote + (scan_until(/#{@quote}/o) or bomb "bad inline quote")
      end
    elsif scan(@sep) then return @es
    elsif scan(@eol) then return nil
    else # consume unquoted cell
      token = scan_until(@eoc) or bomb "unexpected character"
      token.prepend(@eq) if excel
    end
    scan(@sep)
    @strip ? token.strip : token
  end

  def bomb(msg)
    abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
  end

  # ==[ Parser ]==

  def parse
    @rows = []
    while row = next_row
      @rows << row
      count = row.size
      @cols = count if count > @cols
      @cells += count
    end
    @rows
  end

  def next_row
    token = next_token or return
    row = [token]
    row << token while token = next_token
    row
  end

  # ==[ Helpers ]==

  # returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
  def grok(str)
    if idx = str.index(/(#{@quote})|#{"\\"+@sep}|#{@cr}|#{@lf}/o)
      $1 ? 2 : str.index(/#{@quote}/o, idx) ? 2 : 1
    else
      0
    end
  end

  # output a row
  def <<(row)

    # drop trailing empty columns
    row.pop while row.last.empty? if @drop

    s,q = @sep, @quote
    out = case @mode
    when :compact
      case @excel ? 2 : grok(row.join)
      when 0
        row
      when 1
        row.map do |col|
          col.match?(/#{"\\"+@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
        end
      else
        row.map do |col|
          @excel && col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" :
          case grok(col)
          when 0 then col
          when 1 then "#{q}#{col}#{q}"
          else        "#{q}#{col.gsub(q, @esc)}#{q}"
          end
        end
      end
    when :full
      if @excel
        row.map do |col|
          col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
        end
      else
        row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
      end
    end.join(s)

    @out << out + @rowsep
  end

  def each
    @rows ||= parse
    @rows.each {|row| yield row }
  end

  def export(**opts)
    out = opts.empty? ? self : self.class.writer(**opts)
    each {|row| out << row }
  end

  def stats
    wide = string.size.to_s.size
    puts "%#{wide}d rows"    % @rows.size
    puts "%#{wide}d columns" % @cols
    puts "%#{wide}d cells"   % @cells
    puts "%#{wide}d bytes"   % string.size
  end
end

if __FILE__ == $0
  raw = DATA.read
  csv = Censive.new(raw, excel: true, relax: true)
  csv.export(excel: true)
end

__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"
kou commented 1 year ago

That's good to hear.

I have a mid-size CSV file that I have been testing against the standard CSV library that comes with Ruby. In that rough test, the default CSV library processes that file in about 12.5 seconds, whereas the latest updates here allow me to process the same file in 4.67 seconds.

Could you share the benchmark code? I want to improve https://github.com/ruby/csv performance as a maintainer.

However, the scan_until method does not allow the same optimization for strings, since it only accepts regex values. We could potentially benefit from a scan_until method that also accepts a has an optimized path for processings strings. It would also bring scan_until more closely in alignment with the scan method (ie - why should scan accept a string or a regex, where scan_until can only accept a regex?).

The proposal is acceptable. Could you open a new issue for it?

kou commented 1 year ago

FYI: CSV parser in https://rubygems.org/gems/red-arrow is the fastest CSV parser for Ruby:

require "arrow"
Arrow::Table.load("XXX.csv").raw_records
kou commented 1 year ago

See also: Better CSV processing with Ruby 2.6: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2019/

shreeve commented 1 year ago

Update to the update: Not sure what happened earlier, but check the ruby profiler results down below. They show censive now 1.52x faster than csv for the KEN_ALL.CSV parsing:

Update: Oh, my... I ran the benchmarks wrong.

@kou - Thank you for sending those links, they were very helpful.

Following the example in your presentation, here's the results of some benchmarks. For the KEN_ALL.CSV, the following shows that censive (with your help!) is 1.54x faster than csv:

ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
             censive
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
      1.038 i/s -      10.000 times in 9.630566s (963.06ms/i)
                 csv
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
      0.674 i/s -      10.000 times in 14.830379s (1.48s/i)

Comparison:
             censive:         1.0 i/s
                 csv:         0.7 i/s - 1.54x  slower

Here is the censive.yaml benchmark configuration file:

loop_count: 10
prelude: |
  require "digest/md5"
  require "csv"
  require "censive"
  data = File.open("/tmp/KEN_ALL.CSV", "r:cp932") {|f| f.read }; nil
  $stderr.puts
benchmark:
  censive: rows = CSV.parse(data)         ; $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)
  csv:     rows = Censive.new(data).parse ; $stderr.puts "    csv => " + Digest::MD5.hexdigest(rows.join)
shreeve commented 1 year ago

For reference, the data file is:

https://www.post.japanpost.jp/zipcode/dl/oogaki/zip/ken_all.zip

shreeve commented 1 year ago

It would be nice to run the entire benchmark suite against censive also. I believe it may perform even better on complex (or liberal) parsing.

shreeve commented 1 year ago

And the proposed scan_until method that accepts a string would make this even faster.

shreeve commented 1 year ago

I tried to benchmark against red-arrow, but it failed to compile on my Apple Silicon iMac (with the M1 chip). Here's the message:

compiling arrow.cpp
compiling converters.cpp
compiling memory-view.cpp
compiling raw-records.cpp
compiling values.cpp
linking shared-object arrow.bundle
Undefined symbols for architecture arm64:
  "_rbgerr_gerror2exception", referenced from:
      red_arrow::check_status(arrow::Status const&&, char const*) in converters.o
      red_arrow::check_status(arrow::Status const&&, char const*) in raw-records.o
      red_arrow::check_status(arrow::Status const&&, char const*) in values.o
  "_rbgobj_instance_from_ruby_object", referenced from:
      red_arrow::memory_view::primitive_array_get(unsigned long, rb_memory_view_t*, int) in memory-view.o
      red_arrow::memory_view::buffer_get(unsigned long, rb_memory_view_t*, int) in memory-view.o
      red_arrow::record_batch_raw_records(unsigned long) in raw-records.o
      red_arrow::table_raw_records(unsigned long) in raw-records.o
      red_arrow::array_values(unsigned long) in values.o
      red_arrow::chunked_array_values(unsigned long) in values.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [arrow.bundle] Error 1

make failed, exit code 2
shreeve commented 1 year ago

Wow... here's the latest update!

I ran another test to see if I could speed up censive further and it shows that it is actually 1.52x times faster than csv when processing the KEN_ALL.CSV file, here are the benchmark results:

First, here is the benchmark for csv:

ruby/csv/benchmark> ruby -rprofile test-csv.rb 2>&1 | head -25

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 72.31    49.93     49.93   124566     0.40     1.09  CSV::Parser#parse_quotable_loose
  3.24    52.16      2.24  3736999     0.00     0.00  Array#[]
  1.86    53.44      1.28  2117652     0.00     0.00  Integer#+
  1.75    54.66      1.21  1993040     0.00     0.00  String#empty?
  1.70    55.83      1.17  1993041     0.00     0.00  Integer#<
  1.68    56.99      1.16  1868475     0.00     0.00  CSV::Parser#validate_field_size
  1.68    58.15      1.16  1868475     0.00     0.00  Array#<<
  1.66    59.30      1.15   124566     0.01     1.10  CSV::Parser::Scanner#each_line
  1.66    60.45      1.15  1868475     0.00     0.00  String#count
  1.65    61.58      1.14  1868475     0.00     0.00  Integer#zero?
  1.03    62.29      0.71   996523     0.00     0.00  String#[]
  0.95    62.95      0.65   124566     0.01     0.57  Enumerator#next
  0.92    63.58      0.63   996520     0.00     0.00  String#start_with?
  0.91    64.21      0.63   996520     0.00     0.00  String#end_with?
  0.91    64.84      0.63   124565     0.01     0.01  CSV::FieldsConverter#need_convert?
  0.89    65.46      0.61   996549     0.00     0.00  Integer#==
  0.89    66.07      0.61   996520     0.00     0.00  Array#[]=
  0.87    66.67      0.60   124566     0.00     0.01  CSV::Parser::Scanner#keep_start
  0.49    67.01      0.34   124566     0.00     0.00  CSV::Parser::Scanner#keep_drop
  0.49    67.35      0.34   124565     0.00     0.01  CSV::FieldsConverter#convert
  0.41    67.63      0.28        1   283.44 68877.49  String#each_line
  0.33    67.86      0.23   124570     0.00     0.00  String#split
  0.23    68.02      0.16   249130     0.00     0.00  String#include?
  ...

Then, here is the benchmark for censive:

ruby/csv/benchmark> ruby -rprofile test-censive.rb 2>&1 | head -25

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 57.82    26.40     26.40  1993041     0.01     0.02  Censive#next_token
 17.94    34.60      8.19   124566     0.07     0.35  Censive#next_row
 11.83    40.00      5.40  7847598     0.00     0.00  StringScanner#scan
  3.08    41.40      1.41  1868475     0.00     0.00  StringScanner#scan_until
  3.03    42.79      1.38        1  1381.46 45622.85  Censive#parse
  2.55    43.95      1.16  1868475     0.00     0.00  Array#<<
  1.65    44.70      0.76   996521     0.00     0.00  String#[]
  1.52    45.40      0.69   996520     0.00     0.00  String#<<
  0.17    45.48      0.08   124573     0.00     0.00  Integer#>
  0.16    45.55      0.07   124571     0.00     0.00  Integer#+
  0.16    45.62      0.07   124576     0.00     0.00  Array#size
  0.08    45.66      0.04        2    18.01    18.01  IO#read
  0.00    45.66      0.00       22     0.07     0.55  Kernel#require
  0.00    45.66      0.00       92     0.01     0.03  Gem::BasicSpecification#have_file?
  0.00    45.66      0.00        1     0.31     0.31  TracePoint#enable
  0.00    45.66      0.00       38     0.01     0.01  Array#empty?
  0.00    45.66      0.00        4     0.05     0.12  Gem::Version#<=>
  0.00    45.66      0.00       41     0.00     0.04  Array#any?
  0.00    45.66      0.00       18     0.01     0.01  Gem::StubSpecification#missing_extensions?
  0.00    45.66      0.00       30     0.00     0.01  Gem::StubSpecification#extensions
  0.00    45.66      0.00       18     0.01     0.12  Gem::BasicSpecification#contains_requirable_file?
  0.00    45.66      0.00       47     0.00     0.00  Kernel#tap
  0.00    45.66      0.00       18     0.01     0.03  Gem::BasicSpecification#have_extensions?
  ...

Here are the two scripts used:

test-csv.rb:

#!/usr/bin/env ruby

require "csv"
require "digest/md5"

data = File.open("KEN_ALL.CSV", "r:cp932").read

rows = CSV.parse(data)

puts Digest::MD5.hexdigest(rows.join)

test-censive.rb:

#!/usr/bin/env ruby

require "censive"
require "digest/md5"

data = File.open("KEN_ALL.CSV", "r:cp932").read

rows = Censive.parse(data)

puts Digest::MD5.hexdigest(rows.join)

Calculating the speedup

I did three runs of the above and averaged their run times to get a speedup of 1.52x for censive over csv:

censive = ([45.490, 45.605, 45.273].sum / 3.0).round(2) # => 45.46
csv.    = ([68.882, 69.357, 68.920].sum / 3.0).round(2) # => 69.05
(csv / censive).round(2) => 1.52
shreeve commented 1 year ago

Maybe the ruby -rprofile script.rb method isn't very good at comparing benchmark numbers? It gives different results than does the benchmark scripts!?

kou commented 1 year ago
benchmark:
  censive: rows = CSV.parse(data)         ; $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)
  csv:     rows = Censive.new(data).parse ; $stderr.puts "    csv => " + Digest::MD5.hexdigest(rows.join)

Inverted?

benchmark:
  csv:     rows = CSV.parse(data)         ; $stderr.puts "    csv => " + Digest::MD5.hexdigest(rows.join)
  censive: rows = Censive.new(data).parse ; $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)

I tried to benchmark against red-arrow, but it failed to compile on my Apple Silicon iMac (with the M1 chip).

Ah, this was fixed in red-arrow 11.0.0 that will be published soon. See also: https://github.com/apache/arrow/pull/14960

shreeve commented 1 year ago

@kou - Yes, I inverted the initial benchmark tests. Then, I ran ruby -rprofile ... and got a result that showed censive was faster. But, I think that the ruby -rprofile ... method is really only helpful for testing a single program when working to optimize it. It's not very good for comparing across different programs, like the benchmark utility is. When I ran the benchmark utility, I did see that censive was slower by a factor of 1.5x or so.

shreeve commented 1 year ago

@kou - However... I did some interesting work, based on some of your ideas from your presentation, and have made some huge improvements. As of yet, the data for the KEN_ALL.CSV file didn't change that much. Here's my benchmark config file:

loop_count: 10
prelude: |
  require "digest/md5"

  base = "/tmp"

  require "csv"
  require File.join(base, "censive")

  file, mode = "KEN_ALL.CSV", "r:cp932"

  path = File.join(base, file)
  data = File.open(path, *mode) {|f| f.read }; nil
  data = File.open(path, *mode) {|f| f.read }; nil

  $stderr.puts

benchmark:
  csv:     rows = CSV    .parse(data); $stderr.puts "    csv => " + Digest::MD5.hexdigest(rows.join)
  censive: rows = Censive.parse(data); $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)

When using the KEN_ALL.CSV data file, I see this:

ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
                 csv
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
      1.038 i/s -      10.000 times in 9.636874s (963.69ms/i)
             censive
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
      0.699 i/s -      10.000 times in 14.299854s (1.43s/i)

Comparison:
                 csv:         1.0 i/s
             censive:         0.7 i/s - 1.48x  slower

When running another large file of customer data, I see this:

ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
                 csv
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
      0.081 i/s -      10.000 times in 123.819859s (12.38s/i)
             censive
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
      0.160 i/s -      10.000 times in 62.515838s (6.25s/i)

Comparison:
             censive:         0.2 i/s
                 csv:         0.1 i/s - 1.98x  slower

Summary

It seems that there is still a lot of variability based on the specifics of the files. I will do some work on the KEN_ALL.CSV file to see if I can speedup censive. But, for my large customer data file, I see censive is 1.98x faster.

All of this is before doing the work on the scan_until method that will be able to accept a string. I think that should give an additional speedup.

shreeve commented 1 year ago

Updated version of censive based on the above

Look at the section commented under # consume unquoted cell(s). It now tries to read the whole rest of the line. If it can't see any quotes, then it will simply split on @sep. If it finds some embedded quotes, it'll process them in place or back off to the normal processing if needed. Note that, in this case, we could get "a token" that is actually something like "Alice,Bob,Charlie". In this case, we simply call split(@sep) and return back an array of tokens from the next_token method. In the next_row method, we happily deal with either an individual token like Alice or an array such as ["Alice", "Bob", "Charlie"] by using Ruby's splat operator. This means that instead of row << token, we have row.push(*token). This approach gives us a massive speed boost. On one test, I was originally running the test is about 12 seconds. I got it down to about 6.5 seconds using currchar and nextchar. With @kou's help and using the scan method, I got it down to about 4.5 seconds. With this latest work, that same code now runs in 0.7 seconds.

Anyway, here's the updated version of censive using all the above ideas and work. It still needs to be cleaned up a little, but it's coming along nicely! This will also be a good benchmark to test against when we have a version of scan_until that accepts strings.

Censive

#!/usr/bin/env ruby

# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
#   Date: Feb 5, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
#
# Thanks to Sutou Kouhei (kou) for his excellent advice on using scan
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code with streamlined and optimized logic
# 3. Support most non-compliant CSV variations (eg - @excel, @relax, @strip)
#
# TODO: Support IO streaming
# ============================================================================

require "strscan"

class Censive < StringScanner
  def self.parse(...)
    new(...).parse
  end

  def self.writer(obj=nil, **opts, &code)
    case obj
    when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
    when IO,nil then new(out: obj, **opts, &code)
    else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
    end
  end

  def initialize(str="",
    drop:     false   , # drop trailing empty fields?
    encoding: "utf-8" , # character encoding
    excel:    false   , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
    mode:     :compact, # export mode: compact or full
    out:      $stdout , # output stream, needs to respond to <<
    quote:    '"'     , # quote character
    relax:    false   , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
    rowsep:   "\n"    , # row separator for export
    sep:      ","     , # column separator character
    strip:    false   , # strip fields when reading
    **opts              # grab bag
  )
    # data source
    str = File.open(str, "r:#{encoding}").read if !str[100] && File.readable?(str)
    super(str)
    reset

    # options
    @drop     = drop
    @excel    = excel
    @mode     = mode
    @out      = out
    @quote    = quote
    @relax    = relax
    @rowsep   = rowsep
    @sep      = sep
    @strip    = strip

    # definitions
    @cr  = "\r"
    @lf  = "\n"
    @es  = ""
    @eq  = "="
    @esc = (@quote * 2)
    @eol = /#{@cr}#{@lf}?|#{@lf}|\z/o             # end of line
    @eoc = /(?=#{"\\" + @sep}|#{@cr}|#{@lf}|\z)/o # end of cell
  end

  def reset(str=nil)
    self.string = str if str
    super()
    @rows = nil
    @cols = @cells = 0
  end

  # ==[ Lexer ]==

  def next_token
    excel = true if @excel && scan(@eq)

    # consume quoted cell
    if scan(@quote)
      token = ""
      while true
        token << (scan_until(/#{@quote}/o) or bomb "unclosed quote")[0..-2]
        token << @quote and next if scan(@quote)
        break if scan(@eoc)
        @relax or bomb "invalid character after quote"
        token << @quote + (scan_until(/#{@quote}/o) or bomb "bad inline quote")
      end
      scan(@sep)
      return @strip ? token.strip : token
    end

    # consume unquoted cell(s)
    if text = scan(/[^#{@cr}#{@lf}#{@sep}#{@quote}]+[^#{@quote}#{@cr}#{@lf}]*/o)

      #!# TODO: still need to add support for @strip for this section

      return text       .split(@sep, -1) if !check(@quote)
      return text[0..-2].split(@sep, -1) if text[-1] == @sep

      # normal processing
      token = text + scan_until(@eoc) or bomb "unexpected character"
      token.prepend(@eq) if excel
      scan(@sep)
      return token.split(@sep, -1)
    end

    # other tokens
    scan(@sep) and return @es
    scan(@eol)
    nil
  end

  def bomb(msg)
    abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
  end

  # ==[ Parser ]==

  def parse
    @rows = []
    while row = next_row
      @rows << row
      count = row.size
      @cols = count if count > @cols
      @cells += count
    end
    @rows
  end

  def next_row
    token = next_token or return
    row = [*token]
    row.push(*token) while token = next_token
    row
  end

  # ==[ Helpers ]==

  # returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
  def grok(str)
    if idx = str.index(/(#{@quote})|#{"\\"+@sep}|#{@cr}|#{@lf}/o)
      $1 ? 2 : str.index(/#{@quote}/o, idx) ? 2 : 1
    else
      0
    end
  end

  # output a row
  def <<(row)

    # drop trailing empty columns
    row.pop while row.last.empty? if @drop

    s,q = @sep, @quote
    out = case @mode
    when :compact
      case @excel ? 2 : grok(row.join)
      when 0
        row
      when 1
        row.map do |col|
          col.match?(/#{"\\"+@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
        end
      else
        row.map do |col|
          @excel && col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" :
          case grok(col)
          when 0 then col
          when 1 then "#{q}#{col}#{q}"
          else        "#{q}#{col.gsub(q, @esc)}#{q}"
          end
        end
      end
    when :full
      if @excel
        row.map do |col|
          col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
        end
      else
        row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
      end
    end.join(s)

    @out << out + @rowsep
  end

  def each
    @rows ||= parse
    @rows.each {|row| yield row }
  end

  def export(**opts)
    out = opts.empty? ? self : self.class.writer(**opts)
    each {|row| out << row }
  end

  def stats
    wide = string.size.to_s.size
    puts "%#{wide}d rows"    % @rows.size
    puts "%#{wide}d columns" % @cols
    puts "%#{wide}d cells"   % @cells
    puts "%#{wide}d bytes"   % string.size
  end
end

if __FILE__ == $0
  raw = DATA.read
  csv = Censive.new(raw, excel: true, relax: true)
  csv.export # (sep: ",", excel: true)
end

__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"
shreeve commented 1 year ago

@kou - I think perhaps my processing of the KEN_ALL.CSV file is slower because I may have different encodings which is causing a slowdown?

kou commented 1 year ago

I think that it's not related. We can use UTF-8 for KEN_ALL.CSV data with the following configuration but no result difference:

loop_count: 10
prelude: |
  require "csv"
  require "censive"
  data = File.open("/tmp/KEN_ALL.CSV", "r:cp932:utf-8") {|f| f.read }
benchmark:
  csv: CSV.parse(data)
  censive: Censive.new(data).parse
$ ruby -v -S benchmark-driver benchmark/censive.yaml
ruby 3.3.0dev (2023-01-25T01:52:37Z master e82cef1762) [x86_64-linux]
Calculating -------------------------------------
                 csv      0.731 i/s -      10.000 times in 13.687457s (1.37s/i)
             censive      0.398 i/s -      10.000 times in 25.103740s (2.51s/i)

Comparison:
                 csv:         0.7 i/s 
             censive:         0.4 i/s - 1.83x  slower
shreeve commented 1 year ago

@kou - I was working on the scan_until(string) but, the methods I need are these:

pos = rb_str_index(p->str, pattern, p->curr); /* pos is byte index */
pos = rb_str_sublen(p->str, pos); /* pos in now char index */

I can see and call rb_str_sublen, but I can't access rb_str_index. Do I need to pull in an additional header file or something like that?

kou commented 1 year ago

Could you open a new issue for scan_until(String)?

shreeve commented 1 year ago

@kou - Yes, I'll create an issue and will have some initial code to upload also.

shreeve commented 1 year ago

@kou - I worked on several optimizations and found the main one that the CSV library uses, which was fairly easy to implement and a little faster too. Here's the latest update on the KEN_ALL.CSV data:

ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
                 csv

File: "KEN_ALL.CSV"
Size: 12337910 bytes

    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
    csv => 3bda79d8a1d5b062bae6bae9dee3db63
      1.051 i/s -       5.000 times in 4.758448s (951.69ms/i)
             censive

File: "KEN_ALL.CSV"
Size: 12337910 bytes

censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
      1.288 i/s -       5.000 times in 3.882148s (776.43ms/i)

Comparison:
             censive:         1.3 i/s
                 csv:         1.1 i/s - 1.23x  slower

And here is the YAML file for the benchmark:

loop_count: 5
prelude: |
  require "digest/md5"

  base = "/tmp/ruby/csv/benchmark"

  require "csv"
  require File.join(base, "censive")

  name = "KEN_ALL.CSV"
  mode = "r:cp932"

  file = ENV["csvfile"] || name
  mode = "r" unless file == name

  path = File.join(base, file)
  data = File.open(path, *mode) {|f| f.read }; nil

  $stderr.puts "\n\nFile: #{name.inspect}\nSize: #{File.stat(path).size} bytes\n\n"

benchmark:
  csv:     rows = CSV    .parse(data); $stderr.puts "    csv => " + Digest::MD5.hexdigest(rows.join)
  censive: rows = Censive.parse(data); $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)
shreeve commented 1 year ago

Latest version of censive

#!/usr/bin/env ruby

# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
#   Date: Feb 9, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
#
# Thanks to Sutou Kouhei (kou) for his excellent advice on using scan
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code with streamlined and optimized logic
# 3. Support most non-compliant CSV variations (@excel, @relax, etc)
# 4. Support most commonly used CSV options (@sep, @quote, @strip, @drop, etc)
#
# TODO:
# 1. Support IO streaming
# 2. Review all encodings, we may be losing speed when mixing encodings
# 3. Huge speedup possible if our @unquoted regex reads beyond @cr?@lf's
# 4. Will using String#freeze give us a speed up?
# 5. Implement support for scan_until(string) <= right now only regex is valid
# ============================================================================

require "strscan"

class Censive < StringScanner
  attr :encoding

  def self.parse(...)
    new(...).parse
  end

  def self.writer(obj=nil, **opts, &code)
    case obj
    when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
    when IO,nil then new(out: obj, **opts, &code)
    else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
    end
  end

  def initialize(str=nil,
    drop:     false   , # drop trailing empty columns?
    encoding: nil     , # character encoding
    excel:    false   , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
    mode:     :compact, # export mode: compact or full
    out:      nil     , # output stream, needs to respond to <<
    quote:    '"'     , # quote character
    relax:    false   , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
    rowsep:   "\n"    , # row separator for export
    sep:      ","     , # column separator character
    strip:    false   , # strip columns when reading
    **opts              # grab bag
  )
    # initialize data source
    if str && str.size < 100 && File.readable?(str)
      str = File.open(str, encoding ? "r:#{encoding}" : "r").read
    else
      str ||= ""
      str = str.encode(encoding) if encoding
    end
    super(str)
    reset

    # config options
    @cheat    = true
    @drop     = drop
    @encoding = str.encoding
    @excel    = excel
    @mode     = mode
    @out      = out || $stdout
    @relax    = relax
    @strip    = strip

    # config strings
    @quote    = quote
    @rowsep   = rowsep
    @sep      = sep

    # static strings
    @cr       = "\r"
    @lf       = "\n"
    @es       = ""
    @eq       = "="

    # combinations
    @esc      = (@quote * 2)
    @seq      = [@sep, @eq].join # used for parsing in excel mode

    #!# TODO: come up with a clean way to escape/encode all this
    #!# TODO: maybe define @tokens = "#{@quote}#{@sep}#{@cr}#{@lf}", etc.

    # regexes
    @eoc      = /(?=#{"\\" + @sep}|#{@cr}|#{@lf}|\z)/o # end of cell
    @eol      = /#{@cr}#{@lf}?|#{@lf}/o                # end of line
    @escapes  = /(#{@quote})|#{"\\"+@sep}|#{@cr}|#{@lf}/o
    @quotable = /#{"\\"+@sep}|#{@cr}|#{@lf}/o
    @quotes   = /#{@quote}/o
    @seps     = /#{@sep}+/o
    @quoted   = @excel ? /(?:=)?#{@quote}/o : @quote
    @unquoted = /[^#{@sep}#{@cr}#{@lf}][^#{@quote}#{@cr}#{@lf}]*/o
    @leadzero = /\A0\d*\z/
  end

  def reset(str=nil)
    @rows = nil
    @cols = @cells = 0

    #!# TODO: reset all encodings?
    self.string = str if str
    @encoding = string.encoding
    super()
  end

  # ==[ Parser ]==

  def parse
    @rows = []
    while row = next_row
      @rows << row
      count = row.size
      @cols = count if count > @cols
      @cells += count
    end
    @rows
  end

  def next_row
    if @cheat and line = scan_until(@eol)
      row = line.chomp!.split(@sep, -1)
      row.each do |col|
        next if (saw = col.count(@quote)).zero?
        next if (saw == 2) && col.delete_prefix!(@quote) && col.delete_suffix!(@quote)
        @cheat = false
        break
      end if line.include?(@quote)
      @cheat and return @strip ? row.each(&:strip!) : row
      unscan
    end

    token = next_token or return
    row = []
    row.push(*token)
    row.push(*token) while token = next_token
    row
  end

  def next_token
    if scan(@quoted) # quoted cell
      token = ""
      while true
        token << (scan_until(@quotes) or bomb "unclosed quote")[0..-2]
        token << @quote and next if scan(@quote)
        scan(@eoc) and break
        @relax or bomb "invalid character after quote"
        token << @quote + (scan_until(@quotes) or bomb "bad inline quote")
      end
      scan(@sep)
      @strip ? token.strip : token
    elsif match = scan(@unquoted) # unquoted cell(s)
      if check(@quote) && !match.chomp!(@sep) # if we see a stray quote
        unless @excel && match.chomp!(@seq) # unless an excel literal, fix it
          match << (scan_until(@eoc) or bomb "stray quote")
          scan(@sep)
        end
      end
      tokens = match.split(@sep, -1)
      @strip ? tokens.map!(&:strip) : tokens
    elsif scan(@sep)
      match = scan(@seps)
      match ? match.split(@sep, -1) : @es
    else
      scan(@eol)
      nil
    end
  end

  def each
    @rows ||= parse
    @rows.each {|row| yield row }
  end

  def export(**opts)
    out = opts.empty? ? self : self.class.writer(**opts)
    each {|row| out << row }
  end

  # ==[ Helpers ]==

  # returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
  def grok(str)
    if idx = str.index(@escapes)
      $1 ? 2 : str.index(@quotes, idx) ? 2 : 1
    else
      0
    end
  end

  # output a row
  def <<(row)

    # drop trailing empty columns
    row.pop while row.last.empty? if @drop

    s,q = @sep, @quote
    out = case @mode
    when :compact
      case @excel ? 2 : grok(row.join)
      when 0
        row
      when 1
        row.map do |col|
          col.match?(@quotable) ? "#{q}#{col}#{q}" : col
        end
      else
        row.map do |col|
          @excel && col =~ @leadzero ? "=#{q}#{col}#{q}" :
          case grok(col)
          when 0 then col
          when 1 then "#{q}#{col}#{q}"
          else        "#{q}#{col.gsub(q, @esc)}#{q}"
          end
        end
      end
    when :full
      if @excel
        row.map do |col|
          col =~ @leadzero ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
        end
      else
        row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
      end
    end.join(s)

    @out << out + @rowsep
  end

  def stats
    wide = string.size.to_s.size
    puts "%#{wide}d rows"    % @rows.size
    puts "%#{wide}d columns" % @cols
    puts "%#{wide}d cells"   % @cells
    puts "%#{wide}d bytes"   % string.size
  end

  def bomb(msg)
    abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
  end
end

if __FILE__ == $0
  raw = DATA.gets("\n\n").chomp
  # raw = File.read(ARGV.first || "lc-2023.csv")
  # raw = File.open("KEN_ALL.CSV", "r:cp932").read

  csv = Censive.new(raw, excel: true, relax: true)
  csv.export # (excel: true) # sep: "|")
end

__END__
"Don",="007",10,"Ed"
Name,Age,,,Shoe,,,
"Alice",27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
Subtotal,=sum(B2:B5),="01234"

A,B,C,D
A,B,"C",D
A,B,C",D
A,B,"C",D

# first line works in "relax" mode, bottom line is compliant
123,"CHO, JOELLE "JOJO"",456
123,"CHO, JOELLE ""JOJO""",456

# Excel mode checking
=,=x,x=,="x",="","","=",123,0123,="123",="0123"
,=x,x=,x,,,,,,=,,123,="0123",123,,="0123" # <= a little off
shreeve commented 1 year ago

Here's another one with a larger dataset for some customer files:

ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
                 csv

File: "5.csv"
Size: 143935739 bytes

    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
    csv => 56f74997e8d2ab994660af6a7f803532
      0.078 i/s -       5.000 times in 64.303857s (12.86s/i)
             censive

File: "5.csv"
Size: 143935739 bytes

censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
      0.155 i/s -       5.000 times in 32.195004s (6.44s/i)

Comparison:
             censive:         0.2 i/s
                 csv:         0.1 i/s - 2.00x  slower
shreeve commented 1 year ago

Sorry to paste everything here in this issue, but just need a place to document the progress here.

I'll get to the strscan#scan_until(string) next.

Regarding CSV in general, perhaps I should comment over there on that repo, but I've done a lot of work look into alternative implementations. For example, the csv-core from the Rust crate is really interesting. It's implemented as an NFA and then mapped to a DFA, which gives a very fast performance. I'll make some notes and will create an issue in the proper repo for that.

shreeve commented 1 year ago

CSV state machine, largely inspired by the cvs-core library on GitHub.

csv

shreeve commented 1 year ago

The logic for censive is not exactly as above, but this is a very helpful reference.

shreeve commented 1 year ago
digraph finite_state_machine {
  rankdir=LR;
  node [fontname="Helvetica,Arial,sans-serif", shape=circle, style=filled, fillcolor="#dddddd"];
  edge [fontname="Helvetica,Arial,sans-serif"]

  1  [label="1: StartRow"];
  2  [label="2: InComment"];
  3  [label="3: StartColumn", shape=doublecircle, fillcolor="#ffdddd"];
  4  [label="4: InQuotedColumn"];
  5  [label="5: InDoubleEscapedQuote"];
  6  [label="6: InEscapedQuote"];
  7  [label="7: InColumn"];
  8  [label="8: EndColumnSeparator"];
  9  [label="9: EndColumnRow", shape=doublecircle, fillcolor="#ffdddd"];
  10 [label="10: InRowEnd", shape=doublecircle, fillcolor="#ffdddd"];
  11 [label="11: CRLF"];
  12 [label="12: EndRow"];

  1  -> 1  [label="eol / discard"];
  1  -> 2  [label="comment / discard"];
  1  -> 3  [label="* / ε"];

  2  -> 1  [label="LF / discard"];
  2  -> 2  [label="* / discard"];

  3  -> 4  [label="quote & @quoting / discard"];
  3  -> 7  [label="* / copyout"];
  3  -> 8  [label="sep / discard"];
  3  -> 9  [label="eol / ε"]

  4  -> 4  [label="* / copyout"];
  4  -> 5  [label="quote & @quoting / discard"];
  4  -> 6  [label="esc & @quoting / discard"];

  5  -> 4  [label="quote & @quoting & @double-quote / copyout"];
  5  -> 7  [label="* / copyout"];
  5  -> 8  [label="sep / discard"];
  5  -> 9  [label="eol / ε"]

  6  -> 4  [label="* / copyout"];

  7  -> 7  [label="* / copyout"];
  7  -> 8  [label="sep / discard"];
  7  -> 9  [label="eol / ε"]

  8  -> 3  [label="* / ε"];

  9  -> 10 [label="* / ε"];

  10 -> 11 [label="CR & @isCRLF / discard"];
  10 -> 12 [label="* / discard"];

  11 -> 1  [label="* / ε"];
  11 -> 1  [label="LF / discard"];

  12 -> 1  [label="* / ε"];
}
kou commented 1 year ago

I worked on several optimizations and found the main one that the CSV library uses, which was fairly easy to implement and a little faster too.

What is the optimization? String#split?

BTW, how about opening a new issue at https://github.com/shreeve/censive about this topic instead of reusing this issue? It seems that the remaining topic isn't related to strscan.

shreeve commented 1 year ago

@kou - Per your suggestion, I've moved this issue over to https://github.com/shreeve/censive/issues/1