Closed shreeve closed 1 year ago
Here is some in-progress code for a CSV parser I'm working on:
#!/usr/bin/env ruby
# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
# Date: Feb 4, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
# https://github.com/ruby/strscan/issues/53 for details
# https://github.com/ruby/strscan/pull/54 for code
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code base with streamlined logic
# 3. Support for most non-compliant CSV variations (eg - @relax, @excel)
#
# TODO:
# 1. Support IO streaming
# 2. Add option to strip whitespace
# 3. Support CSV headers in first row
# ============================================================================
require "bundler/setup"
require "strscan"
class Censive < StringScanner
def self.writer(obj=nil, **opts, &code)
case obj
when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
when IO,nil then new(out: obj, **opts, &code)
else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
end
end
def initialize(str=nil,
drop: false , # drop trailing empty fields?
eol: "\n" , # line endings for exports
excel: false , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
mode: :compact, # export mode: compact or full
out: nil , # output stream, needs to respond to <<
quote: '"' , # quote character
relax: false , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
sep: "," , # column separator character
**opts # grab bag
)
super(str || "")
reset
@drop = drop
@eol = eol
@excel = excel
@mode = mode
@out = out || $stdout
@quote = quote
@relax = relax
@sep = sep
@cr = "\r"
@lf = "\n"
@es = ""
@eq = "="
@esc = (@quote * 2)
end
def reset(str=nil)
self.string = str if str
super()
@char = curr_char
@rows = nil
@cols = @cells = 0
end
# ==[ Lexer ]==
# pure ruby versions for debugging
# def curr_char; @char = string[pos]; end
# def next_char; scan(/./m); @char = string[pos]; end
def curr_char; @char = currchar; end
def next_char; @char = nextchar; end
def next_token
if @excel && @char == @eq
excel = true
next_char
end
if @char == @quote # consume quoted cell
token = ""
while true
next_char
token << (scan_until(/(?=#{@quote})/o) or bomb "unclosed quote")
token << @quote and next if next_char == @quote
break if [@sep,@cr,@lf,@es,nil].include?(@char)
@relax or bomb "invalid character after quote"
token << @quote + scan_until(/(?=#{@quote})/o) + @quote
end
next_char if @char == @sep
token
elsif [@sep,@cr,@lf,@es,nil].include?(@char)
case @char
when @sep then next_char ; @es
when @cr then next_char == @lf and next_char; nil
when @lf then next_char ; nil
else nil
end
else # consume unquoted cell
token = scan_until(/(?=#{@sep}|#{@cr}|#{@lf}|\z)/o) or bomb "unexpected character"
token.prepend(@eq) if excel
next_char if curr_char == @sep
token
end
end
def bomb(msg)
abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
end
# ==[ Parser ]==
def parse
@rows = []
while row = next_row
@rows << row
count = row.size
@cols = count if count > @cols
@cells += count
end
@rows
end
def next_row
token = next_token or return
row = [token]
row << token while token = next_token
row
end
# ==[ Helpers ]==
# returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
def grok(str)
if idx = str.index(/(#{@quote})|#{@sep}|#{@cr}|#{@lf}/o) #!# FIXME: regex injection?
$1 ? 2 : str.index(/#{@quote}/o, idx) ? 2 : 1
else
0
end
end
# output a row
def <<(row)
# drop trailing empty columns
row.pop while row.last.empty? if @drop
s,q = @sep, @quote
out = case @mode
when :compact
case @excel ? 2 : grok(row.join)
when 0
row
when 1
row.map do |col|
col.match?(/#{@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
end
else
row.map do |col|
@excel && col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" :
case grok(col)
when 0 then col
when 1 then "#{q}#{col}#{q}"
else "#{q}#{col.gsub(q, @esc)}#{q}"
end
end
end
when :full
if @excel
row.map do |col|
col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
end
else
row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
end
end.join(s)
@out << out + @eol
end
def each
@rows ||= parse
@rows.each {|row| yield row }
end
def export(**opts)
out = opts.empty? ? self : self.class.writer(**opts)
each {|row| out << row }
end
def stats
wide = string.size.to_s.size
puts "%#{wide}d rows" % @rows.size
puts "%#{wide}d columns" % @cols
puts "%#{wide}d cells" % @cells
puts "%#{wide}d bytes" % string.size
end
end
if __FILE__ == $0
raw = DATA.read
csv = Censive.new(raw, excel: true, relax: true)
csv.export # (sep: ",", excel: true)
end
__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"
For some performance numbers from the baseline Ruby version vs. these methods in native strscan
:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
21.08 5.37 2.04 226917 0.01 0.01 Censive#next_char
13.52 6.67 1.31 188985 0.01 0.01 Censive#curr_char
strscan
% cumulative self self total
time seconds seconds calls ms/call ms/call name
9.06 4.85 0.61 226917 0.00 0.00 Censive#next_char
7.39 5.35 0.50 188985 0.00 0.00 Censive#curr_char
nextchar speedup = 2.04 secs / 0.61 secs => 3.34x faster
currchar speedup = 1.31 secs / 0.50 secs => 2.62x faster
Why do you want to compare the current character by yourself instead of StringScanner
API?
For example, why do you use if @char == @quote # consume quoted cell
instead of if scan(@quote)
?
@kou - There are many reasons. In the case you cited, the value for the current character was previously read, so it's already available as a variable. As we are parsing the input, there are many cases when a character is being checked and does not match, but we don't necessarily want to go back to the scanner and recheck it again if we already have the value.
Another reason is that the scan
method only takes a regex and it also consumes it's input, etc. The way the scanner is used is highly dependent on the logic in the program. In the case of the censive
library (the code I pasted above), we have to very carefully control when we invoke the scanner, when we can reuse a previous character, when we advance the pointer or not, carefully manage where we are in the string, etc.
The goal of PR #54 discussed here is to simply add two new very helpful methods for using strscan
.
Using curr_char
simply returns the current character, which depends on what we've done so far, and upon method calls such as scan
or scan_until
, where the final/current character might not be known. This can easily happen when doing something like this: s.scan_until(/(?=#{@quote})/o)
. In this case, I'm asking strscan
to scan upto, but not include, the final character (in this case here, it's up to a quote). There are many cases where this is extremely helpful, but currently we cannot do this with strscan
, but curr_char
does this in a performant and simple way.
Using next_char
is extremely useful when, for whatever reason, I already know the current character and I just need to advance one more character and return it, in one efficient call.
There is currently not a simple, highly performant way to do this in strscan
. This PR #54 solves this in a simple way.
I don't think that you need to extract the current character and compare it out of StringScanner
. You can just use scan
:
#!/usr/bin/env ruby
# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
# Date: Feb 4, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
# https://github.com/ruby/strscan/issues/53 for details
# https://github.com/ruby/strscan/pull/54 for code
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code base with streamlined logic
# 3. Support for most non-compliant CSV variations (eg - @relax, @excel)
#
# TODO:
# 1. Support IO streaming
# 2. Add option to strip whitespace
# 3. Support CSV headers in first row
# ============================================================================
require "bundler/setup"
require "strscan"
class Censive < StringScanner
def self.writer(obj=nil, **opts, &code)
case obj
when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
when IO,nil then new(out: obj, **opts, &code)
else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
end
end
def initialize(str=nil,
drop: false , # drop trailing empty fields?
eol: "\n" , # line endings for exports
excel: false , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
mode: :compact, # export mode: compact or full
out: nil , # output stream, needs to respond to <<
quote: '"' , # quote character
relax: false , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
sep: "," , # column separator character
**opts # grab bag
)
super(str || "")
reset
@drop = drop
@eol = eol
@excel = excel
@mode = mode
@out = out || $stdout
@quote = quote
@relax = relax
@sep = sep
@cr = "\r"
@lf = "\n"
@es = ""
@eq = "="
@esc = (@quote * 2)
end
def reset(str=nil)
self.string = str if str
super()
@rows = nil
@cols = @cells = 0
end
# ==[ Lexer ]==
# pure ruby versions for debugging
#def curr_char; @char = string[pos]; end
#def next_char; scan(/./m); @char = string[pos]; end
def curr_char; @char = currchar; end
def next_char; @char = nextchar; end
def next_token
return nil if eos?
if @excel && scan(@eq)
excel = true
end
if scan(@quote) # consume quoted cell
token = (scan_until(/#{Regexp.escape(@quote)}/o) or bomb "unclosed quote")[0..-2]
scan(@sep)
token
elsif scan(@sep)
@es
elsif scan(@cr)
scan(@lf)
nil
elsif scan(@lf)
nil
else # consume unquoted cell
token = scan_until(/(?=#{@sep}|#{@cr}|#{@lf}|\z)/o) or bomb "unexpected character"
token.prepend(@eq) if excel
scan(@sep)
token
end
end
def bomb(msg)
abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
end
# ==[ Parser ]==
def parse
@rows = []
while row = next_row
@rows << row
count = row.size
@cols = count if count > @cols
@cells += count
end
@rows
end
def next_row
token = next_token or return
row = [token]
row << token while token = next_token
row
end
# ==[ Helpers ]==
# returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
def grok(str)
if idx = str.index(/(#{@quote})|#{@sep}|#{@cr}|#{@lf}/o) #!# FIXME: regex injection?
$1 ? 2 : str.index(/#{@quote}/o, idx) ? 2 : 1
else
0
end
end
# output a row
def <<(row)
# drop trailing empty columns
row.pop while row.last.empty? if @drop
s,q = @sep, @quote
out = case @mode
when :compact
case @excel ? 2 : grok(row.join)
when 0
row
when 1
row.map do |col|
col.match?(/#{@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
end
else
row.map do |col|
@excel && col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" :
case grok(col)
when 0 then col
when 1 then "#{q}#{col}#{q}"
else "#{q}#{col.gsub(q, @esc)}#{q}"
end
end
end
when :full
if @excel
row.map do |col|
col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
end
else
row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
end
end.join(s)
@out << out + @eol
end
def each
@rows ||= parse
@rows.each {|row| yield row }
end
def export(**opts)
out = opts.empty? ? self : self.class.writer(**opts)
each {|row| out << row }
end
def stats
wide = string.size.to_s.size
puts "%#{wide}d rows" % @rows.size
puts "%#{wide}d columns" % @cols
puts "%#{wide}d cells" % @cells
puts "%#{wide}d bytes" % string.size
end
end
if __FILE__ == $0
raw = DATA.read
csv = Censive.new(raw, excel: true, relax: true)
csv.export # (sep: ",", excel: true)
end
__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"
The goal of PR #54 discussed here is to simply add two new very helpful methods for using
strscan
.
In general, I don't want to add APIs that don't match recommended usage.
Your suggestions were very helpful! I was able to make some tweaks and now the code is running significantly faster, with just the standard scan
method from the strscan
code. I am really impressed and the code is actually simpler also. I have now put the additional code in that is needed for handling common CSV variants, etc. I think the result looks great and it's much faster and cleaner.
I have a mid-size CSV file that I have been testing against the standard CSV library that comes with Ruby. In that rough test, the default CSV library processes that file in about 12.5 seconds, whereas the latest updates here allow me to process the same file in 4.67 seconds. Thank you so much for the idea on how to do this better. I'll close the pull request, since it wasn't needed.
Now that I see how helpful the raw scan
command is, there is one potential PR that could be helpful. I didn't realize before that the scan
method can accept a string for the argument (I thought it only took regex values). When you pass a string to scan
, it uses an optimization to make the processing very quick. However, the scan_until
method does not allow the same optimization for strings, since it only accepts regex values. We could potentially benefit from a scan_until
method that also accepts a has an optimized path for processings strings. It would also bring scan_until
more closely in alignment with the scan
method (ie - why should scan
accept a string or a regex, where scan_until
can only accept a regex?).
Again, thank you for your suggestions!
Here's the updated code, with your suggestions. I still have a few things to tweak, but I'm almost there. :-)
#!/usr/bin/env ruby
# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
# Date: Feb 5, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
#
# Thanks to Sutou Kouhei (kou) for his excellent advice on using scan
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code with streamlined and optimized logic
# 3. Support most non-compliant CSV variations (eg - @excel, @relax, @strip)
#
# TODO: Support IO streaming
# ============================================================================
require "strscan"
class Censive < StringScanner
def self.writer(obj=nil, **opts, &code)
case obj
when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
when IO,nil then new(out: obj, **opts, &code)
else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
end
end
def initialize(str=nil,
drop: false , # drop trailing empty fields?
excel: false , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
mode: :compact, # export mode: compact or full
out: $stdout , # output stream, needs to respond to <<
quote: '"' , # quote character
relax: false , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
rowsep: "\n" , # row separator for export
sep: "," , # column separator character
strip: false , # strip fields when reading
**opts # grab bag
)
super(str || "")
reset
# options
@drop = drop
@excel = excel
@mode = mode
@out = out
@quote = quote
@relax = relax
@rowsep = rowsep
@sep = sep
@strip = strip
# determined
@cr = "\r"
@lf = "\n"
@es = ""
@eq = "="
@esc = (@quote * 2)
@eol = /#{@cr}#{@lf}?|#{@lf}|\z/o # end of line
@eoc = /(?=#{"\\" + @sep}|#{@cr}|#{@lf}|\z)/o # end of cell
end
def reset(str=nil)
self.string = str if str
super()
@rows = nil
@cols = @cells = 0
end
# ==[ Lexer ]==
def next_token
excel = true if @excel && scan(@eq)
if scan(@quote) # consume quoted cell
token = ""
while true
token << (scan_until(/#{@quote}/o) or bomb "unclosed quote")[0..-2]
token << @quote and next if scan(@quote)
break if scan(@eoc)
@relax or bomb "invalid character after quote"
token << @quote + (scan_until(/#{@quote}/o) or bomb "bad inline quote")
end
elsif scan(@sep) then return @es
elsif scan(@eol) then return nil
else # consume unquoted cell
token = scan_until(@eoc) or bomb "unexpected character"
token.prepend(@eq) if excel
end
scan(@sep)
@strip ? token.strip : token
end
def bomb(msg)
abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
end
# ==[ Parser ]==
def parse
@rows = []
while row = next_row
@rows << row
count = row.size
@cols = count if count > @cols
@cells += count
end
@rows
end
def next_row
token = next_token or return
row = [token]
row << token while token = next_token
row
end
# ==[ Helpers ]==
# returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
def grok(str)
if idx = str.index(/(#{@quote})|#{"\\"+@sep}|#{@cr}|#{@lf}/o)
$1 ? 2 : str.index(/#{@quote}/o, idx) ? 2 : 1
else
0
end
end
# output a row
def <<(row)
# drop trailing empty columns
row.pop while row.last.empty? if @drop
s,q = @sep, @quote
out = case @mode
when :compact
case @excel ? 2 : grok(row.join)
when 0
row
when 1
row.map do |col|
col.match?(/#{"\\"+@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
end
else
row.map do |col|
@excel && col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" :
case grok(col)
when 0 then col
when 1 then "#{q}#{col}#{q}"
else "#{q}#{col.gsub(q, @esc)}#{q}"
end
end
end
when :full
if @excel
row.map do |col|
col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
end
else
row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
end
end.join(s)
@out << out + @rowsep
end
def each
@rows ||= parse
@rows.each {|row| yield row }
end
def export(**opts)
out = opts.empty? ? self : self.class.writer(**opts)
each {|row| out << row }
end
def stats
wide = string.size.to_s.size
puts "%#{wide}d rows" % @rows.size
puts "%#{wide}d columns" % @cols
puts "%#{wide}d cells" % @cells
puts "%#{wide}d bytes" % string.size
end
end
if __FILE__ == $0
raw = DATA.read
csv = Censive.new(raw, excel: true, relax: true)
csv.export(excel: true)
end
__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"
That's good to hear.
I have a mid-size CSV file that I have been testing against the standard CSV library that comes with Ruby. In that rough test, the default CSV library processes that file in about 12.5 seconds, whereas the latest updates here allow me to process the same file in 4.67 seconds.
Could you share the benchmark code? I want to improve https://github.com/ruby/csv performance as a maintainer.
However, the
scan_until
method does not allow the same optimization for strings, since it only accepts regex values. We could potentially benefit from ascan_until
method that also accepts a has an optimized path for processings strings. It would also bringscan_until
more closely in alignment with thescan
method (ie - why shouldscan
accept a string or a regex, wherescan_until
can only accept a regex?).
The proposal is acceptable. Could you open a new issue for it?
FYI: CSV parser in https://rubygems.org/gems/red-arrow is the fastest CSV parser for Ruby:
require "arrow"
Arrow::Table.load("XXX.csv").raw_records
See also: Better CSV processing with Ruby 2.6: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2019/
censive
now 1.52x faster than csv
for the KEN_ALL.CSV
parsing:@kou - Thank you for sending those links, they were very helpful.
Following the example in your presentation, here's the results of some benchmarks. For the KEN_ALL.CSV
, the following shows that censive
(with your help!) is 1.54x faster than csv
:
ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
censive
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
1.038 i/s - 10.000 times in 9.630566s (963.06ms/i)
csv
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
0.674 i/s - 10.000 times in 14.830379s (1.48s/i)
Comparison:
censive: 1.0 i/s
csv: 0.7 i/s - 1.54x slower
Here is the censive.yaml
benchmark configuration file:
loop_count: 10
prelude: |
require "digest/md5"
require "csv"
require "censive"
data = File.open("/tmp/KEN_ALL.CSV", "r:cp932") {|f| f.read }; nil
$stderr.puts
benchmark:
censive: rows = CSV.parse(data) ; $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)
csv: rows = Censive.new(data).parse ; $stderr.puts " csv => " + Digest::MD5.hexdigest(rows.join)
For reference, the data file is:
https://www.post.japanpost.jp/zipcode/dl/oogaki/zip/ken_all.zip
It would be nice to run the entire benchmark suite against censive
also. I believe it may perform even better on complex (or liberal) parsing.
And the proposed scan_until
method that accepts a string would make this even faster.
I tried to benchmark against red-arrow
, but it failed to compile on my Apple Silicon iMac (with the M1 chip). Here's the message:
compiling arrow.cpp
compiling converters.cpp
compiling memory-view.cpp
compiling raw-records.cpp
compiling values.cpp
linking shared-object arrow.bundle
Undefined symbols for architecture arm64:
"_rbgerr_gerror2exception", referenced from:
red_arrow::check_status(arrow::Status const&&, char const*) in converters.o
red_arrow::check_status(arrow::Status const&&, char const*) in raw-records.o
red_arrow::check_status(arrow::Status const&&, char const*) in values.o
"_rbgobj_instance_from_ruby_object", referenced from:
red_arrow::memory_view::primitive_array_get(unsigned long, rb_memory_view_t*, int) in memory-view.o
red_arrow::memory_view::buffer_get(unsigned long, rb_memory_view_t*, int) in memory-view.o
red_arrow::record_batch_raw_records(unsigned long) in raw-records.o
red_arrow::table_raw_records(unsigned long) in raw-records.o
red_arrow::array_values(unsigned long) in values.o
red_arrow::chunked_array_values(unsigned long) in values.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [arrow.bundle] Error 1
make failed, exit code 2
I ran another test to see if I could speed up censive
further and it shows that it is actually 1.52x times faster than csv
when processing the KEN_ALL.CSV
file, here are the benchmark results:
csv
:ruby/csv/benchmark> ruby -rprofile test-csv.rb 2>&1 | head -25
% cumulative self self total
time seconds seconds calls ms/call ms/call name
72.31 49.93 49.93 124566 0.40 1.09 CSV::Parser#parse_quotable_loose
3.24 52.16 2.24 3736999 0.00 0.00 Array#[]
1.86 53.44 1.28 2117652 0.00 0.00 Integer#+
1.75 54.66 1.21 1993040 0.00 0.00 String#empty?
1.70 55.83 1.17 1993041 0.00 0.00 Integer#<
1.68 56.99 1.16 1868475 0.00 0.00 CSV::Parser#validate_field_size
1.68 58.15 1.16 1868475 0.00 0.00 Array#<<
1.66 59.30 1.15 124566 0.01 1.10 CSV::Parser::Scanner#each_line
1.66 60.45 1.15 1868475 0.00 0.00 String#count
1.65 61.58 1.14 1868475 0.00 0.00 Integer#zero?
1.03 62.29 0.71 996523 0.00 0.00 String#[]
0.95 62.95 0.65 124566 0.01 0.57 Enumerator#next
0.92 63.58 0.63 996520 0.00 0.00 String#start_with?
0.91 64.21 0.63 996520 0.00 0.00 String#end_with?
0.91 64.84 0.63 124565 0.01 0.01 CSV::FieldsConverter#need_convert?
0.89 65.46 0.61 996549 0.00 0.00 Integer#==
0.89 66.07 0.61 996520 0.00 0.00 Array#[]=
0.87 66.67 0.60 124566 0.00 0.01 CSV::Parser::Scanner#keep_start
0.49 67.01 0.34 124566 0.00 0.00 CSV::Parser::Scanner#keep_drop
0.49 67.35 0.34 124565 0.00 0.01 CSV::FieldsConverter#convert
0.41 67.63 0.28 1 283.44 68877.49 String#each_line
0.33 67.86 0.23 124570 0.00 0.00 String#split
0.23 68.02 0.16 249130 0.00 0.00 String#include?
...
censive
:ruby/csv/benchmark> ruby -rprofile test-censive.rb 2>&1 | head -25
% cumulative self self total
time seconds seconds calls ms/call ms/call name
57.82 26.40 26.40 1993041 0.01 0.02 Censive#next_token
17.94 34.60 8.19 124566 0.07 0.35 Censive#next_row
11.83 40.00 5.40 7847598 0.00 0.00 StringScanner#scan
3.08 41.40 1.41 1868475 0.00 0.00 StringScanner#scan_until
3.03 42.79 1.38 1 1381.46 45622.85 Censive#parse
2.55 43.95 1.16 1868475 0.00 0.00 Array#<<
1.65 44.70 0.76 996521 0.00 0.00 String#[]
1.52 45.40 0.69 996520 0.00 0.00 String#<<
0.17 45.48 0.08 124573 0.00 0.00 Integer#>
0.16 45.55 0.07 124571 0.00 0.00 Integer#+
0.16 45.62 0.07 124576 0.00 0.00 Array#size
0.08 45.66 0.04 2 18.01 18.01 IO#read
0.00 45.66 0.00 22 0.07 0.55 Kernel#require
0.00 45.66 0.00 92 0.01 0.03 Gem::BasicSpecification#have_file?
0.00 45.66 0.00 1 0.31 0.31 TracePoint#enable
0.00 45.66 0.00 38 0.01 0.01 Array#empty?
0.00 45.66 0.00 4 0.05 0.12 Gem::Version#<=>
0.00 45.66 0.00 41 0.00 0.04 Array#any?
0.00 45.66 0.00 18 0.01 0.01 Gem::StubSpecification#missing_extensions?
0.00 45.66 0.00 30 0.00 0.01 Gem::StubSpecification#extensions
0.00 45.66 0.00 18 0.01 0.12 Gem::BasicSpecification#contains_requirable_file?
0.00 45.66 0.00 47 0.00 0.00 Kernel#tap
0.00 45.66 0.00 18 0.01 0.03 Gem::BasicSpecification#have_extensions?
...
Here are the two scripts used:
test-csv.rb
:#!/usr/bin/env ruby
require "csv"
require "digest/md5"
data = File.open("KEN_ALL.CSV", "r:cp932").read
rows = CSV.parse(data)
puts Digest::MD5.hexdigest(rows.join)
test-censive.rb
:#!/usr/bin/env ruby
require "censive"
require "digest/md5"
data = File.open("KEN_ALL.CSV", "r:cp932").read
rows = Censive.parse(data)
puts Digest::MD5.hexdigest(rows.join)
I did three runs of the above and averaged their run times to get a speedup of 1.52x for censive
over csv
:
censive = ([45.490, 45.605, 45.273].sum / 3.0).round(2) # => 45.46
csv. = ([68.882, 69.357, 68.920].sum / 3.0).round(2) # => 69.05
(csv / censive).round(2) => 1.52
Maybe the ruby -rprofile script.rb
method isn't very good at comparing benchmark numbers? It gives different results than does the benchmark
scripts!?
benchmark:
censive: rows = CSV.parse(data) ; $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)
csv: rows = Censive.new(data).parse ; $stderr.puts " csv => " + Digest::MD5.hexdigest(rows.join)
Inverted?
benchmark:
csv: rows = CSV.parse(data) ; $stderr.puts " csv => " + Digest::MD5.hexdigest(rows.join)
censive: rows = Censive.new(data).parse ; $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)
I tried to benchmark against red-arrow, but it failed to compile on my Apple Silicon iMac (with the M1 chip).
Ah, this was fixed in red-arrow 11.0.0 that will be published soon. See also: https://github.com/apache/arrow/pull/14960
@kou - Yes, I inverted the initial benchmark tests. Then, I ran ruby -rprofile ...
and got a result that showed censive
was faster. But, I think that the ruby -rprofile ...
method is really only helpful for testing a single program when working to optimize it. It's not very good for comparing across different programs, like the benchmark
utility is. When I ran the benchmark
utility, I did see that censive
was slower by a factor of 1.5x or so.
@kou - However... I did some interesting work, based on some of your ideas from your presentation, and have made some huge improvements. As of yet, the data for the KEN_ALL.CSV
file didn't change that much. Here's my benchmark config file:
loop_count: 10
prelude: |
require "digest/md5"
base = "/tmp"
require "csv"
require File.join(base, "censive")
file, mode = "KEN_ALL.CSV", "r:cp932"
path = File.join(base, file)
data = File.open(path, *mode) {|f| f.read }; nil
data = File.open(path, *mode) {|f| f.read }; nil
$stderr.puts
benchmark:
csv: rows = CSV .parse(data); $stderr.puts " csv => " + Digest::MD5.hexdigest(rows.join)
censive: rows = Censive.parse(data); $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)
KEN_ALL.CSV
data file, I see this:ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
csv
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
1.038 i/s - 10.000 times in 9.636874s (963.69ms/i)
censive
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
0.699 i/s - 10.000 times in 14.299854s (1.43s/i)
Comparison:
csv: 1.0 i/s
censive: 0.7 i/s - 1.48x slower
ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
csv
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
0.081 i/s - 10.000 times in 123.819859s (12.38s/i)
censive
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
0.160 i/s - 10.000 times in 62.515838s (6.25s/i)
Comparison:
censive: 0.2 i/s
csv: 0.1 i/s - 1.98x slower
It seems that there is still a lot of variability based on the specifics of the files. I will do some work on the KEN_ALL.CSV
file to see if I can speedup censive
. But, for my large customer data file, I see censive
is 1.98x faster.
All of this is before doing the work on the scan_until
method that will be able to accept a string. I think that should give an additional speedup.
censive
based on the aboveLook at the section commented under # consume unquoted cell(s)
. It now tries to read the whole rest of the line. If it can't see any quotes, then it will simply split
on @sep
. If it finds some embedded quotes, it'll process them in place or back off to the normal processing if needed. Note that, in this case, we could get "a token" that is actually something like "Alice,Bob,Charlie". In this case, we simply call split(@sep)
and return back an array of tokens from the next_token
method. In the next_row
method, we happily deal with either an individual token like Alice
or an array such as ["Alice", "Bob", "Charlie"]
by using Ruby's splat operator. This means that instead of row << token
, we have row.push(*token)
. This approach gives us a massive speed boost. On one test, I was originally running the test is about 12 seconds. I got it down to about 6.5 seconds using currchar
and nextchar
. With @kou's help and using the scan
method, I got it down to about 4.5 seconds. With this latest work, that same code now runs in 0.7 seconds.
Anyway, here's the updated version of censive
using all the above ideas and work. It still needs to be cleaned up a little, but it's coming along nicely! This will also be a good benchmark to test against when we have a version of scan_until
that accepts strings.
#!/usr/bin/env ruby
# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
# Date: Feb 5, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
#
# Thanks to Sutou Kouhei (kou) for his excellent advice on using scan
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code with streamlined and optimized logic
# 3. Support most non-compliant CSV variations (eg - @excel, @relax, @strip)
#
# TODO: Support IO streaming
# ============================================================================
require "strscan"
class Censive < StringScanner
def self.parse(...)
new(...).parse
end
def self.writer(obj=nil, **opts, &code)
case obj
when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
when IO,nil then new(out: obj, **opts, &code)
else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
end
end
def initialize(str="",
drop: false , # drop trailing empty fields?
encoding: "utf-8" , # character encoding
excel: false , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
mode: :compact, # export mode: compact or full
out: $stdout , # output stream, needs to respond to <<
quote: '"' , # quote character
relax: false , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
rowsep: "\n" , # row separator for export
sep: "," , # column separator character
strip: false , # strip fields when reading
**opts # grab bag
)
# data source
str = File.open(str, "r:#{encoding}").read if !str[100] && File.readable?(str)
super(str)
reset
# options
@drop = drop
@excel = excel
@mode = mode
@out = out
@quote = quote
@relax = relax
@rowsep = rowsep
@sep = sep
@strip = strip
# definitions
@cr = "\r"
@lf = "\n"
@es = ""
@eq = "="
@esc = (@quote * 2)
@eol = /#{@cr}#{@lf}?|#{@lf}|\z/o # end of line
@eoc = /(?=#{"\\" + @sep}|#{@cr}|#{@lf}|\z)/o # end of cell
end
def reset(str=nil)
self.string = str if str
super()
@rows = nil
@cols = @cells = 0
end
# ==[ Lexer ]==
def next_token
excel = true if @excel && scan(@eq)
# consume quoted cell
if scan(@quote)
token = ""
while true
token << (scan_until(/#{@quote}/o) or bomb "unclosed quote")[0..-2]
token << @quote and next if scan(@quote)
break if scan(@eoc)
@relax or bomb "invalid character after quote"
token << @quote + (scan_until(/#{@quote}/o) or bomb "bad inline quote")
end
scan(@sep)
return @strip ? token.strip : token
end
# consume unquoted cell(s)
if text = scan(/[^#{@cr}#{@lf}#{@sep}#{@quote}]+[^#{@quote}#{@cr}#{@lf}]*/o)
#!# TODO: still need to add support for @strip for this section
return text .split(@sep, -1) if !check(@quote)
return text[0..-2].split(@sep, -1) if text[-1] == @sep
# normal processing
token = text + scan_until(@eoc) or bomb "unexpected character"
token.prepend(@eq) if excel
scan(@sep)
return token.split(@sep, -1)
end
# other tokens
scan(@sep) and return @es
scan(@eol)
nil
end
def bomb(msg)
abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
end
# ==[ Parser ]==
def parse
@rows = []
while row = next_row
@rows << row
count = row.size
@cols = count if count > @cols
@cells += count
end
@rows
end
def next_row
token = next_token or return
row = [*token]
row.push(*token) while token = next_token
row
end
# ==[ Helpers ]==
# returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
def grok(str)
if idx = str.index(/(#{@quote})|#{"\\"+@sep}|#{@cr}|#{@lf}/o)
$1 ? 2 : str.index(/#{@quote}/o, idx) ? 2 : 1
else
0
end
end
# output a row
def <<(row)
# drop trailing empty columns
row.pop while row.last.empty? if @drop
s,q = @sep, @quote
out = case @mode
when :compact
case @excel ? 2 : grok(row.join)
when 0
row
when 1
row.map do |col|
col.match?(/#{"\\"+@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
end
else
row.map do |col|
@excel && col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" :
case grok(col)
when 0 then col
when 1 then "#{q}#{col}#{q}"
else "#{q}#{col.gsub(q, @esc)}#{q}"
end
end
end
when :full
if @excel
row.map do |col|
col =~ /\A0\d*\z/ ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
end
else
row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
end
end.join(s)
@out << out + @rowsep
end
def each
@rows ||= parse
@rows.each {|row| yield row }
end
def export(**opts)
out = opts.empty? ? self : self.class.writer(**opts)
each {|row| out << row }
end
def stats
wide = string.size.to_s.size
puts "%#{wide}d rows" % @rows.size
puts "%#{wide}d columns" % @cols
puts "%#{wide}d cells" % @cells
puts "%#{wide}d bytes" % string.size
end
end
if __FILE__ == $0
raw = DATA.read
csv = Censive.new(raw, excel: true, relax: true)
csv.export # (sep: ",", excel: true)
end
__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"
@kou - I think perhaps my processing of the KEN_ALL.CSV
file is slower because I may have different encodings which is causing a slowdown?
I think that it's not related. We can use UTF-8
for KEN_ALL.CSV
data with the following configuration but no result difference:
loop_count: 10
prelude: |
require "csv"
require "censive"
data = File.open("/tmp/KEN_ALL.CSV", "r:cp932:utf-8") {|f| f.read }
benchmark:
csv: CSV.parse(data)
censive: Censive.new(data).parse
$ ruby -v -S benchmark-driver benchmark/censive.yaml
ruby 3.3.0dev (2023-01-25T01:52:37Z master e82cef1762) [x86_64-linux]
Calculating -------------------------------------
csv 0.731 i/s - 10.000 times in 13.687457s (1.37s/i)
censive 0.398 i/s - 10.000 times in 25.103740s (2.51s/i)
Comparison:
csv: 0.7 i/s
censive: 0.4 i/s - 1.83x slower
@kou - I was working on the scan_until(string)
but, the methods I need are these:
pos = rb_str_index(p->str, pattern, p->curr); /* pos is byte index */
pos = rb_str_sublen(p->str, pos); /* pos in now char index */
I can see and call rb_str_sublen
, but I can't access rb_str_index
. Do I need to pull in an additional header file or something like that?
Could you open a new issue for scan_until(String)
?
@kou - Yes, I'll create an issue and will have some initial code to upload also.
@kou - I worked on several optimizations and found the main one that the CSV library uses, which was fairly easy to implement and a little faster too. Here's the latest update on the KEN_ALL.CSV
data:
ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
csv
File: "KEN_ALL.CSV"
Size: 12337910 bytes
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
csv => 3bda79d8a1d5b062bae6bae9dee3db63
1.051 i/s - 5.000 times in 4.758448s (951.69ms/i)
censive
File: "KEN_ALL.CSV"
Size: 12337910 bytes
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
censive => 3bda79d8a1d5b062bae6bae9dee3db63
1.288 i/s - 5.000 times in 3.882148s (776.43ms/i)
Comparison:
censive: 1.3 i/s
csv: 1.1 i/s - 1.23x slower
And here is the YAML file for the benchmark:
loop_count: 5
prelude: |
require "digest/md5"
base = "/tmp/ruby/csv/benchmark"
require "csv"
require File.join(base, "censive")
name = "KEN_ALL.CSV"
mode = "r:cp932"
file = ENV["csvfile"] || name
mode = "r" unless file == name
path = File.join(base, file)
data = File.open(path, *mode) {|f| f.read }; nil
$stderr.puts "\n\nFile: #{name.inspect}\nSize: #{File.stat(path).size} bytes\n\n"
benchmark:
csv: rows = CSV .parse(data); $stderr.puts " csv => " + Digest::MD5.hexdigest(rows.join)
censive: rows = Censive.parse(data); $stderr.puts "censive => " + Digest::MD5.hexdigest(rows.join)
censive
#!/usr/bin/env ruby
# ============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
# Date: Feb 9, 2023
#
# https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
#
# Thanks to Sutou Kouhei (kou) for his excellent advice on using scan
# ============================================================================
# GOALS:
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code with streamlined and optimized logic
# 3. Support most non-compliant CSV variations (@excel, @relax, etc)
# 4. Support most commonly used CSV options (@sep, @quote, @strip, @drop, etc)
#
# TODO:
# 1. Support IO streaming
# 2. Review all encodings, we may be losing speed when mixing encodings
# 3. Huge speedup possible if our @unquoted regex reads beyond @cr?@lf's
# 4. Will using String#freeze give us a speed up?
# 5. Implement support for scan_until(string) <= right now only regex is valid
# ============================================================================
require "strscan"
class Censive < StringScanner
attr :encoding
def self.parse(...)
new(...).parse
end
def self.writer(obj=nil, **opts, &code)
case obj
when String then File.open(obj, "w") {|io| yield new(out: io, **opts, &code) }
when IO,nil then new(out: obj, **opts, &code)
else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
end
end
def initialize(str=nil,
drop: false , # drop trailing empty columns?
encoding: nil , # character encoding
excel: false , # literals ="01" formulas =A1 + B2 http://bit.ly/3Y7jIvc
mode: :compact, # export mode: compact or full
out: nil , # output stream, needs to respond to <<
quote: '"' , # quote character
relax: false , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
rowsep: "\n" , # row separator for export
sep: "," , # column separator character
strip: false , # strip columns when reading
**opts # grab bag
)
# initialize data source
if str && str.size < 100 && File.readable?(str)
str = File.open(str, encoding ? "r:#{encoding}" : "r").read
else
str ||= ""
str = str.encode(encoding) if encoding
end
super(str)
reset
# config options
@cheat = true
@drop = drop
@encoding = str.encoding
@excel = excel
@mode = mode
@out = out || $stdout
@relax = relax
@strip = strip
# config strings
@quote = quote
@rowsep = rowsep
@sep = sep
# static strings
@cr = "\r"
@lf = "\n"
@es = ""
@eq = "="
# combinations
@esc = (@quote * 2)
@seq = [@sep, @eq].join # used for parsing in excel mode
#!# TODO: come up with a clean way to escape/encode all this
#!# TODO: maybe define @tokens = "#{@quote}#{@sep}#{@cr}#{@lf}", etc.
# regexes
@eoc = /(?=#{"\\" + @sep}|#{@cr}|#{@lf}|\z)/o # end of cell
@eol = /#{@cr}#{@lf}?|#{@lf}/o # end of line
@escapes = /(#{@quote})|#{"\\"+@sep}|#{@cr}|#{@lf}/o
@quotable = /#{"\\"+@sep}|#{@cr}|#{@lf}/o
@quotes = /#{@quote}/o
@seps = /#{@sep}+/o
@quoted = @excel ? /(?:=)?#{@quote}/o : @quote
@unquoted = /[^#{@sep}#{@cr}#{@lf}][^#{@quote}#{@cr}#{@lf}]*/o
@leadzero = /\A0\d*\z/
end
def reset(str=nil)
@rows = nil
@cols = @cells = 0
#!# TODO: reset all encodings?
self.string = str if str
@encoding = string.encoding
super()
end
# ==[ Parser ]==
def parse
@rows = []
while row = next_row
@rows << row
count = row.size
@cols = count if count > @cols
@cells += count
end
@rows
end
def next_row
if @cheat and line = scan_until(@eol)
row = line.chomp!.split(@sep, -1)
row.each do |col|
next if (saw = col.count(@quote)).zero?
next if (saw == 2) && col.delete_prefix!(@quote) && col.delete_suffix!(@quote)
@cheat = false
break
end if line.include?(@quote)
@cheat and return @strip ? row.each(&:strip!) : row
unscan
end
token = next_token or return
row = []
row.push(*token)
row.push(*token) while token = next_token
row
end
def next_token
if scan(@quoted) # quoted cell
token = ""
while true
token << (scan_until(@quotes) or bomb "unclosed quote")[0..-2]
token << @quote and next if scan(@quote)
scan(@eoc) and break
@relax or bomb "invalid character after quote"
token << @quote + (scan_until(@quotes) or bomb "bad inline quote")
end
scan(@sep)
@strip ? token.strip : token
elsif match = scan(@unquoted) # unquoted cell(s)
if check(@quote) && !match.chomp!(@sep) # if we see a stray quote
unless @excel && match.chomp!(@seq) # unless an excel literal, fix it
match << (scan_until(@eoc) or bomb "stray quote")
scan(@sep)
end
end
tokens = match.split(@sep, -1)
@strip ? tokens.map!(&:strip) : tokens
elsif scan(@sep)
match = scan(@seps)
match ? match.split(@sep, -1) : @es
else
scan(@eol)
nil
end
end
def each
@rows ||= parse
@rows.each {|row| yield row }
end
def export(**opts)
out = opts.empty? ? self : self.class.writer(**opts)
each {|row| out << row }
end
# ==[ Helpers ]==
# returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
def grok(str)
if idx = str.index(@escapes)
$1 ? 2 : str.index(@quotes, idx) ? 2 : 1
else
0
end
end
# output a row
def <<(row)
# drop trailing empty columns
row.pop while row.last.empty? if @drop
s,q = @sep, @quote
out = case @mode
when :compact
case @excel ? 2 : grok(row.join)
when 0
row
when 1
row.map do |col|
col.match?(@quotable) ? "#{q}#{col}#{q}" : col
end
else
row.map do |col|
@excel && col =~ @leadzero ? "=#{q}#{col}#{q}" :
case grok(col)
when 0 then col
when 1 then "#{q}#{col}#{q}"
else "#{q}#{col.gsub(q, @esc)}#{q}"
end
end
end
when :full
if @excel
row.map do |col|
col =~ @leadzero ? "=#{q}#{col}#{q}" : "#{q}#{col.gsub(q, @esc)}#{q}"
end
else
row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
end
end.join(s)
@out << out + @rowsep
end
def stats
wide = string.size.to_s.size
puts "%#{wide}d rows" % @rows.size
puts "%#{wide}d columns" % @cols
puts "%#{wide}d cells" % @cells
puts "%#{wide}d bytes" % string.size
end
def bomb(msg)
abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
end
end
if __FILE__ == $0
raw = DATA.gets("\n\n").chomp
# raw = File.read(ARGV.first || "lc-2023.csv")
# raw = File.open("KEN_ALL.CSV", "r:cp932").read
csv = Censive.new(raw, excel: true, relax: true)
csv.export # (excel: true) # sep: "|")
end
__END__
"Don",="007",10,"Ed"
Name,Age,,,Shoe,,,
"Alice",27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
Subtotal,=sum(B2:B5),="01234"
A,B,C,D
A,B,"C",D
A,B,C",D
A,B,"C",D
# first line works in "relax" mode, bottom line is compliant
123,"CHO, JOELLE "JOJO"",456
123,"CHO, JOELLE ""JOJO""",456
# Excel mode checking
=,=x,x=,="x",="","","=",123,0123,="123",="0123"
,=x,x=,x,,,,,,=,,123,="0123",123,,="0123" # <= a little off
Here's another one with a larger dataset for some customer files:
ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
Calculating -------------------------------------
csv
File: "5.csv"
Size: 143935739 bytes
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
csv => 56f74997e8d2ab994660af6a7f803532
0.078 i/s - 5.000 times in 64.303857s (12.86s/i)
censive
File: "5.csv"
Size: 143935739 bytes
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
censive => 56f74997e8d2ab994660af6a7f803532
0.155 i/s - 5.000 times in 32.195004s (6.44s/i)
Comparison:
censive: 0.2 i/s
csv: 0.1 i/s - 2.00x slower
Sorry to paste everything here in this issue, but just need a place to document the progress here.
I'll get to the strscan#scan_until(string)
next.
Regarding CSV in general, perhaps I should comment over there on that repo, but I've done a lot of work look into alternative implementations. For example, the csv-core
from the Rust crate is really interesting. It's implemented as an NFA and then mapped to a DFA, which gives a very fast performance. I'll make some notes and will create an issue in the proper repo for that.
CSV state machine, largely inspired by the cvs-core
library on GitHub.
The logic for censive
is not exactly as above, but this is a very helpful reference.
digraph finite_state_machine {
rankdir=LR;
node [fontname="Helvetica,Arial,sans-serif", shape=circle, style=filled, fillcolor="#dddddd"];
edge [fontname="Helvetica,Arial,sans-serif"]
1 [label="1: StartRow"];
2 [label="2: InComment"];
3 [label="3: StartColumn", shape=doublecircle, fillcolor="#ffdddd"];
4 [label="4: InQuotedColumn"];
5 [label="5: InDoubleEscapedQuote"];
6 [label="6: InEscapedQuote"];
7 [label="7: InColumn"];
8 [label="8: EndColumnSeparator"];
9 [label="9: EndColumnRow", shape=doublecircle, fillcolor="#ffdddd"];
10 [label="10: InRowEnd", shape=doublecircle, fillcolor="#ffdddd"];
11 [label="11: CRLF"];
12 [label="12: EndRow"];
1 -> 1 [label="eol / discard"];
1 -> 2 [label="comment / discard"];
1 -> 3 [label="* / ε"];
2 -> 1 [label="LF / discard"];
2 -> 2 [label="* / discard"];
3 -> 4 [label="quote & @quoting / discard"];
3 -> 7 [label="* / copyout"];
3 -> 8 [label="sep / discard"];
3 -> 9 [label="eol / ε"]
4 -> 4 [label="* / copyout"];
4 -> 5 [label="quote & @quoting / discard"];
4 -> 6 [label="esc & @quoting / discard"];
5 -> 4 [label="quote & @quoting & @double-quote / copyout"];
5 -> 7 [label="* / copyout"];
5 -> 8 [label="sep / discard"];
5 -> 9 [label="eol / ε"]
6 -> 4 [label="* / copyout"];
7 -> 7 [label="* / copyout"];
7 -> 8 [label="sep / discard"];
7 -> 9 [label="eol / ε"]
8 -> 3 [label="* / ε"];
9 -> 10 [label="* / ε"];
10 -> 11 [label="CR & @isCRLF / discard"];
10 -> 12 [label="* / discard"];
11 -> 1 [label="* / ε"];
11 -> 1 [label="LF / discard"];
12 -> 1 [label="* / ε"];
}
I worked on several optimizations and found the main one that the CSV library uses, which was fairly easy to implement and a little faster too.
What is the optimization? String#split
?
BTW, how about opening a new issue at https://github.com/shreeve/censive about this topic instead of reusing this issue? It seems that the remaining topic isn't related to strscan.
@kou - Per your suggestion, I've moved this issue over to https://github.com/shreeve/censive/issues/1
The
strscan
library is great for high performance work. To read "the next character", we can use thegetch
method. However, it returns the character at the current position, but then advances the pointer to the next character. This is exactly what is needed in many cases. However, sometimes we just need to access the current character, without advancing the pointer. We can't always usepeek(1)
, because it is not multibyte aware. What we need here is a "get the current character" method. Also, there are times when we know the current character, but we want to advance to and get the next character. A method callednextchar
would work great.Here's a diagram:
Adding these two methods:
currchar
andnextchar
tostrscan
allows us to gain two very helpful methods, which are both very performant. Using these, I've been able to increase the parsing speed of one of my CSV libraries by 60%.