topazproject / topaz

A high performance ruby, written in RPython
topazruby.com
BSD 3-Clause "New" or "Revised" License
1k stars 88 forks source link

Bad IO#readlines performances on large file #336

Open pelletier opened 11 years ago

pelletier commented 11 years ago
$ cat /tmp/test.rb
f = File.open("/usr/share/dict/words", "r")
f.readlines()
f.close()

$ time ruby /tmp/test.rb
ruby /tmp/test.rb  0.09s user 0.02s system 92% cpu 0.125 total

$ time topaz /tmp/test.rb
topaz /tmp/test.rb  0.95s user 0.08s system 94% cpu 1.085 total

$ cat /usr/share/dict/words | wc -l
  235886

$ ruby --version
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin12.0.0]

Topaz was compiled with -Ojit on f5d88a0b7e0fe3e9293a9de90212651717321e80.

alex commented 11 years ago

I don't see an obvious reason for this, can you see if you can reproduce it by just calling read(8192) over and over again to read the full file?

pelletier commented 11 years ago

Something like that?

$ cat /tmp/test2.rb
f = File.open("/usr/share/dict/words", "r")
while f.read(8192) do
end
f.close()

$ time ruby /tmp/test2.rb
ruby /tmp/test2.rb  0.02s user 0.01s system 91% cpu 0.034 total

$ time topaz /tmp/test2.rb
topaz /tmp/test2.rb  0.01s user 0.01s system 84% cpu 0.025 total
alex commented 11 years ago

So it's a bug in readlines() itself, or something else readlines() calls, not in the actual reading of data.

pelletier commented 11 years ago

readlines() delegates the IO work to each_line(), and something wrong is probably happening there:

$ cat /tmp/test3.rb 
f = File.open("/usr/share/dict/words", "r")
f.each_line { |line| }
f.close()

$ time ruby /tmp/test3.rb 
ruby /tmp/test3.rb  0.09s user 0.01s system 96% cpu 0.105 total

$ time topaz /tmp/test3.rb 
topaz /tmp/test3.rb  0.81s user 0.03s system 98% cpu 0.852 total
alex commented 11 years ago

Oh, another thing occurred to me, this could just be the JIT kicking in and having slow warm up time, can you put ths in a loop and try doing it 10 times (or something) and see if it changes?

alex commented 11 years ago

I ran it in a loop ten times, we're still super slow, it's something inefficent in that algorithm.

alex commented 11 years ago

80% of the time is spent in String#split, that's totally bonkers so that's a good place to start fixing.

alex commented 11 years ago

I think the right startup point on this is to replace the split logic here with a buffer and indexes into it.

pelletier commented 11 years ago

OK I will try that when I have some spare time.

pelletier commented 11 years ago

I don't know what changed exactly since my last comment, but it seems to be a bit better now:

time ./bin/topaz test.rb
./bin/topaz test.rb  0.33s user 0.02s system 98% cpu 0.364 total
meatballhat commented 11 years ago

Hopefully addressed in #501

pelletier commented 11 years ago

I just tested with your PR: things are actually worse:

~/code/topaz > cat test.rb
f = File.open("words", "r")
f.each_line { |line| }
f.close()
~/code/topaz > cat words|wc -l
    5000

~/code/topaz > git checkout pr/501
Switched to branch 'pr/501'
~/code/topaz > time python -m topaz test.rb
 python -m topaz test.rb  91.35s user 0.86s system 82% cpu 1:52.30 total

~/code/topaz > git checkout master
Switched to branch 'master'
~/code/topaz > time python -m topaz test.rb
python -m topaz test.rb  7.00s user 0.42s system 89% cpu 8.315 total

~/code/topaz > time ruby test.rb
ruby test.rb  0.02s user 0.01s system 90% cpu 0.029 total
meatballhat commented 11 years ago

Ha! Well, I can't take much credit for the buffering implementation, but I'd like to think improving it over in rpython is preferable to tuning it only for Topaz. Still waiting on feedback from @alex on #501 :-) On Mar 29, 2013 11:01 AM, "Thomas Pelletier" notifications@github.com wrote:

I just tested with your PR: things are actually worse:

~/code/topaz > cat test.rb f = File.open("words", "r") f.each_line { |line| } f.close() ~/code/topaz > cat words|wc -l 5000

~/code/topaz > git checkout pr/501 Switched to branch 'pr/501' ~/code/topaz > time python -m topaz test.rb python -m topaz test.rb 91.35s user 0.86s system 82% cpu 1:52.30 total

~/code/topaz > git checkout master Switched to branch 'master' ~/code/topaz > time python -m topaz test.rb python -m topaz test.rb 7.00s user 0.42s system 89% cpu 8.315 total

~/code/topaz > time ruby test.rb ruby test.rb 0.02s user 0.01s system 90% cpu 0.029 total

— Reply to this email directly or view it on GitHubhttps://github.com/topazproject/topaz/issues/336#issuecomment-15644514 .