Open shreeve opened 1 year ago
@kou - The main thing now is that I can do comparisons directly with the main cvs
library and show that censive
is significantly faster in all the cases that I've tested. It was slower in the case of the KEN_ALL.CSV
data, but that is changed now, with these numbers:
censive/lib> time test-csv.rb
3bda79d8a1d5b062bae6bae9dee3db63 KEN_ALL.CSV (12337910 size)
test-csv.rb 0.98s user 0.06s system 79% cpu 1.312 total
censive/lib> time test-censive.rb
3bda79d8a1d5b062bae6bae9dee3db63 KEN_ALL.CSV (12337910 size)
test-censive.rb 0.81s user 0.04s system 87% cpu 0.978 total
In this case, with the KEN_ALL.CSV
data, the results are (in this example):
1.312 seconds / 0.978 seconds = 1.34x speedup for censive
Another advantage is that the code is a fraction of the size. The censive
library is not fully done yet, but is it very usable as-is and has support already for some nice features and API. Still work to do to finish it all off though.
The main thing is was @kou's suggestion on how to better use the strscan
library.
@kou - I'm also working on a much easier benchmarking tool called flay
, which I plan to use instead of benchmark-driver
. As that code comes along, I'd love to do several other benchmarks to compare with strscan
and csv
and several other libraries. This has been a lot of fun and I think it's nice to help speedup code that many people are using.
Here's the result of running winr
to compare censive
and csv
:
And here is the winr
config file for this:
{
environments: [
{
begin: <<~"|".strip,
require "digest/md5"
path = ARGV[0] || "KEN_ALL.CSV"
mode = path =~ /^ken/i ? "r:cp932" : "r"
data = File.open(path, mode).read
|
}
],
tasks: [
{
name: "csv",
begin: "require 'csv'",
script: <<~"|".strip,
rows = CSV.parse(data)
puts "%s %s (%d size)" % [Digest::MD5.hexdigest(rows.join), path, File.stat(path).size], ""
|
},
{
name: "censive",
begin: "require 'censive'",
script: <<~"|".strip,
rows = Censive.parse(data)
puts "%s %s (%d size)" % [Digest::MD5.hexdigest(rows.join), path, File.stat(path).size], ""
|
},
],
}
@kou - Including you here for reference. Now, I'll go back to strscan
and will try to add the issue to allow scan_until(string)
.
Thanks!
Per @kou - Moving this conversation over here from https://github.com/ruby/strscan/issues/53
Yes, the code is here, and is enabled in the
@cheat
instance variable, until and if it gets deactivated.There is another optimization, even larger, where you could do this for the entire file, if it's determined that the file doesn't contain a single quote. But, this seems uncommon and is probably not worth it.