oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
3.02k stars 185 forks source link

Prawn::Images::PNG#split_image_data is slow on TruffleRuby #2599

Open eregon opened 2 years ago

eregon commented 2 years ago

From https://github.com/prawnpdf/prawn/pull/1246#issuecomment-1040197295:

From the flamegraph it seems most of time (31.2% using Search) is spent in Prawn::Images::PNG#split_image_data and specifically in StringIO#read and String#byteslice called from there. The main cost is probably scanning the substring to compute the correct coderange (7-bit or valid for BINARY strings). This should improve when TruffleRuby has adopted TruffleString (WIP). It also indicates that in this case it might be better to compute the coderange lazily, since it's one of the rare cases where the coderange doesn't seem needed (for color.write data.read(color_bytes)), cc @djoooooe.

From a wider perspective it's an unfortunate case of Ruby not having a real "byte array" structure, and String being a relatively poor fit for it.

bjfish commented 2 years ago

Link to Ruby feature request for byte arrays

eregon commented 1 year ago

I have taken another look at this. To repro, clone prawn, bundle and then bundle exec rspec spec/prawn/images/png_spec.rb. To profile: TRUFFLERUBYOPT=--cpusampler=flamegraph bundle exec rspec spec/prawn/images/png_spec.rb

The whole test suite takes 80.61s locally which seems reasonable with truffleruby-dev.

That specific spec takes 37.44 seconds, so it's not solved yet. Taking a look at the profile, byteslice is no longer part of it but now there is these TruffleRuby.synchronized calls which were added to make StringIO thread-safe and might be some overhead. I also tried on TruffleRuby native 22.3 which doesn't have the TruffleRuby.synchronized but has TruffleString and that runs the spec in 33.62 seconds, so the synchronization doesn't seem the main overhead here. CRuby 3.1 takes <1 second for that spec.

I think the perf problem is due to that code using StringIO#putc and that in TruffleRuby (for master at least) just delegates to write which then goes into the pos < str.bytesize case and uses Primitive.string_splice and that creates two lazy Concat strings which is very expensive due to lots of allocations for single bytes like that. I think what we want is to use setbyte or equivalent logic for such a case, so go to the MutableTruffleString and then can be efficient and avoid allocations. There are also StringIO#write calls though in that method so we also need to handle that better, i.e., Primitive.string_splice needs to avoid doing lazy Concat strings when it gets a MutableTruffleString and writes a string of N bytes with byte_count_to_replace=N.

It seems best to fix Primitive.string_splice first as that should also cover putc without needing special logic in putc.