pythonicrubyist / creek

Ruby library for parsing large Excel files.
http://rubygems.org/gems/creek
MIT License
388 stars 109 forks source link

Large Memory and Speed performance regression from 2.5.3 to 2.6.3 #122

Open azrazalea-debtbook opened 7 months ago

azrazalea-debtbook commented 7 months ago

We upgraded to rails 7.1.3 and at the same time I upgraded our version of creek from 2.5.3 to 2.6.3.

After deploying the new rails version, we noticed our memory usage going up significantly as well as our time to process spreadsheets.

I tested the two versions in irb isolation and saw no difference, with a slight improvement for 2.6.3. However, when in docker the difference is massive.

(See https://github.com/pythonicrubyist/creek/issues/122#issuecomment-1957764405, my testing methodology wasn't great for local Mac and the difference is less but still large. This is not just docker)

[3] [app][development] pry(main)> Creek::VERSION
=> "2.5.3"
[4] [app][development] pry(main)> time = Time.now; creek = Creek::Book.new('performance-test.xlsx'); puts(Time.now - time);
8.105815712
[5] [app][development] pry(main)> time = Time.now; creek.sheets[0].simple_rows.map(&:inspect); puts(Time.now - time);
38.974233879
[6] [app][development] pry(main)>
VmPeak:   709908 kB
VmSize:   593128 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    616644 kB
VmRSS:    518568 kB
VmData:   559256 kB
VmStk:      8188 kB
VmExe:         4 kB
VmLib:     24516 kB
VmPTE:      1372 kB
VmSwap:        0 kB
[1] [app][development] pry(main)> Creek::VERSION
=> "2.6.3"
[2] [app][development] pry(main)> time = Time.now; creek = Creek::Book.new('performance-test.xlsx'); puts(Time.now - time);
8.186497337
[3] [app][development] pry(main)> time = Time.now; creek.sheets[0].simple_rows.map(&:inspect); puts(Time.now - time);
66.137462044
[4] [app][development] pry(main)>
VmPeak:  3852528 kB
VmSize:   603644 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   3765544 kB
VmRSS:    529308 kB
VmData:   569772 kB
VmStk:      8188 kB
VmExe:         4 kB
VmLib:     24516 kB
VmPTE:      1520 kB
VmSwap:        0 kB

As you can see in the above, the problem appears to be the row generator. Also, the peak memory usage went from 709,000 KB to 3,852,000 KB, 5 times more memory!

Here is the spreadsheet I used for testing (a subset of chicago's public crime data) https://drive.usercontent.google.com/download?id=1zWiCRYCS7Vs9EPZyOqcKr-eT59F7w3dm&export=download (give it a minute to start downloading, google has to virus scan it first and it takes forever)

azrazalea-debtbook commented 7 months ago

More system information:

Ruby 3.2.2 Docker arm64 alpine linux on Mac OS X host via Docker Desktop

azrazalea-debtbook commented 7 months ago

Related to #121 but more specific and I was able to find some data I can show you to reproduce it!

azrazalea-debtbook commented 7 months ago

Found that this is actually docker not rails. I spun up a fresh rails instance and wasn't producing it locally. Then, I used the rails default dockerfile modified to strip it back closer to essentials.

This is on arm64 architecture but I have the same problem in our production amd64 architecture docker images too.

# syntax = docker/dockerfile:1

# Make sure RUBY_VERSION matches the Ruby version in .ruby-version and Gemfile
ARG RUBY_VERSION=3.2.2
FROM registry.docker.com/library/ruby:$RUBY_VERSION-slim as base

# Rails app lives here
WORKDIR /rails

# Set production environment
ENV BUNDLE_DEPLOYMENT="1" \
    BUNDLE_PATH="/usr/local/bundle" \
    BUNDLE_WITHOUT="development"

# Throw-away build stage to reduce size of final image
FROM base as build

# Install packages needed to build gems
RUN apt-get update -qq && \
    apt-get install --no-install-recommends -y build-essential git libvips pkg-config

# Copy application code
COPY . .

# Final stage for app image
FROM base

# Install packages needed for deployment
RUN apt-get update -qq && \
    apt-get install --no-install-recommends -y curl libsqlite3-0 libvips && \
    rm -rf /var/lib/apt/lists /var/cache/apt/archives

# Copy built artifacts: gems, application
COPY --from=build /usr/local/bundle /usr/local/bundle
COPY --from=build /rails /rails

# Run and own only the runtime files as a non-root user for security
RUN useradd rails --create-home --shell /bin/bash && \
    chown -R rails:rails db log storage tmp
USER rails:rails

RUN gem install creek

CMD ["bash"]

And built it docker build .

Then ran docker docker run -it <image sha>

Finally inside docker I ran irb:

rails@5058bc9c86b6:/rails$ irb

Inside irb I did:

irb(main):001:0> require 'creek'
=> true
irb(main):003:0> time = Time.now; creek = Creek::Book.new('./performance-test.xlsx'); puts(Time.now - time);
7.558300586
=> nil
irb(main):004:0> time = Time.now; creek.sheets[0].simple_rows.map(&:inspect); puts(Time.now - time);
52.017292079
=> nil

In a separate shell to the same docker container (docker exec -it <container SHA (NOT image sha> bash) I did:

rails@5058bc9c86b6:/rails$ ps -wef
UID        PID  PPID  C STIME TTY          TIME CMD
rails        1     0  0 17:07 pts/0    00:00:00 bash
rails       18     1 26 17:08 pts/0    00:01:00 irb
rails       20     0  0 17:12 pts/1    00:00:00 bash
rails       27    20  0 17:12 pts/1    00:00:00 ps -wef
rails@5058bc9c86b6:/rails$ grep -i vm /proc/18/status
VmPeak:  3552164 kB
VmSize:  3551392 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   3408084 kB
VmRSS:   3408084 kB
VmData:  3462272 kB
VmStk:      8188 kB
VmExe:         4 kB
VmLib:     12232 kB
VmPTE:      6840 kB
VmSwap:        0 kB
azrazalea-debtbook commented 7 months ago

Note: I tried this in an amd_64 version of the same container and got the same result. 3GB RAM usage in 2.6.3, 800MB RAM in 2.5.3. Still on Mac OS X host though

azrazalea-debtbook commented 7 months ago

Nevermind, I'm not used to Mac OS X and had trouble figuring out the peak memory usage.

Got gtime working though and amd64 Mac OS X Sonoma 14.3.1 ruby 3.2.2 shows a similar though smaller effect.

So this is not docker or rails specific, I just failed at figuring out peak memory usage on Mac and the speed didn't get way slower like it did in docker.

It is better locally at least. Only about 2.6x more memory not 5x.

lilithcarpenter@Liliths-MacBook-Pro tmp-rails % gtime -v irb
irb(main):001> require 'creek'
=> true
irb(main):002> Creek::VERSION
=> "2.6.3"
irb(main):003> time = Time.now; creek = Creek::Book.new('./performance-test.xlsx'); puts(Time.now - time);
6.727392
irb(main):004> time = Time.now; creek.sheets[0].simple_rows.map(&:inspect); puts(Time.now - time);
32.480844
irb(main):005>
    Command being timed: "irb"
    User time (seconds): 36.39
    System time (seconds): 2.42
    Percent of CPU this job got: 21%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:56.84
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1691680
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 14
    Minor (reclaiming a frame) page faults: 654378
    Voluntary context switches: 1737
    Involuntary context switches: 20670
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 16384
    Exit status: 0
lilithcarpenter@Liliths-MacBook-Pro tmp-rails % gtime -v irb
irb(main):001> gem 'creek', '2.5.3'
=> true
irb(main):002> require 'creek'
=> true
irb(main):003> time = Time.now; creek = Creek::Book.new('./performance-test.xlsx'); puts(Time.now - time);
6.498211
irb(main):004> time = Time.now; creek.sheets[0].simple_rows.map(&:inspect); puts(Time.now - time);
27.670476
irb(main):005>
    Command being timed: "irb"
    User time (seconds): 34.16
    System time (seconds): 0.57
    Percent of CPU this job got: 53%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:05.09
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 647200
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 15
    Minor (reclaiming a frame) page faults: 44136
    Voluntary context switches: 3867
    Involuntary context switches: 1967
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 16384
    Exit status: 0
azrazalea-debtbook commented 7 months ago

Identified commit 494ed0505930ef2a70c26204bf8231228b6aa963 as the culprit.

Version 2.6

irb(main):001> gem 'creek', '2.6.2'
=> true
irb(main):002> require 'creek'
=> true
irb(main):003> time = Time.now; creek = Creek::Book.new('performance-test.xlsx'); puts(Time.now - time);
6.848251
irb(main):004> time = Time.now; creek.sheets[0].simple_rows.map(&:inspect); puts(Time.now - time);
35.083837
irb(main):005>
    Command being timed: "irb"
    User time (seconds): 39.23
    System time (seconds): 2.26
    Percent of CPU this job got: 39%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:45.88
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 2344960
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 43
    Minor (reclaiming a frame) page faults: 632327
    Voluntary context switches: 797
    Involuntary context switches: 13858
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 16384
    Exit status: 0

Commit f8856ee3bd707f114a0f8aaa88145810e6af5052

lilithcarpenter@Liliths-MacBook-Pro test-creek % gtime -v bundle exec irb
irb(main):001:0> require 'creek'
=> true
irb(main):002:0> time = Time.now; creek = Creek::Book.new('performance-test.xlsx'); puts(Time.now - time);
6.800597
=> nil
irb(main):003:0> time = Time.now; creek.sheets[0].simple_rows.map(&:inspect); puts(Time.now - time);
27.776927
=> nil
irb(main):004:0>
    Command being timed: "bundle exec irb"
    User time (seconds): 34.50
    System time (seconds): 0.51
    Percent of CPU this job got: 38%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:30.84
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 545728
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 29
    Minor (reclaiming a frame) page faults: 49358
    Voluntary context switches: 1595
    Involuntary context switches: 1773
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 16384
    Exit status: 0

Commit 494ed0505930ef2a70c26204bf8231228b6aa963

lilithcarpenter@Liliths-MacBook-Pro test-creek % gtime -v bundle exec irb
irb(main):001:0> require 'creek'
=> true
irb(main):002:0> time = Time.now; creek = Creek::Book.new('performance-test.xlsx'); puts(Time.now - time);
6.739267
=> nil
irb(main):003:0> time = Time.now; creek.sheets[0].simple_rows.map(&:inspect); puts(Time.now - time);
36.350612
=> nil
irb(main):004:0>
    Command being timed: "bundle exec irb"
    User time (seconds): 40.49
    System time (seconds): 2.09
    Percent of CPU this job got: 47%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:28.87
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 2022608
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 105
    Minor (reclaiming a frame) page faults: 624585
    Voluntary context switches: 747
    Involuntary context switches: 20804
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 16384
    Exit status: 0