inukshuk / jekyll-scholar

jekyll extensions for the blogging scholar
MIT License
1.12k stars 102 forks source link

when serving, page rendering takes longer each time #335

Open sneakers-the-rat opened 2 years ago

sneakers-the-rat commented 2 years ago

Hello! Really appreciate the plugin. I've noticed it might be that the version of Jekyll i'm using isn't supported (4.2.1) but i've noticed when i have the jekyll-scholar gem loaded and am rendering a bibliography on a page, every time that I change the source and the pages rebuild they take n times longer -- eg. if original page takes 20 seconds to build, second build takes 40, third takes 60, and so on. I've tried --verbose, --profile and a bunch of other profiling methods to figure out what's taking longer but can't quite crack it -- though with --trace it's always in the middle of some jekyll-scholar method when i call it.

Any idea what might be causing the time doubling? some cache somewhere maybe? If you point me in the right direction I can try and PR since I've seen elsewhere you're mostly in maintenance mode <3

inukshuk commented 2 years ago

Huh, that sounds bad. Thanks for reporting!

I'm relatively certain that this hasn't been the case in the past, though, in general we could speed up incremental builds by using the Cache API. But what you describe definitely sounds like an additional bug. If the builds take incrementally longer, we should first figure out which code paths keep repeating and why.

I'd try to simplify the page as much as possible. That is, disable all other plugins and use the @bibliography tag only once. Then either install byebug and use breakpoints or add some puts statements to jekyll-scholar to trace what's being run during the incremental builds. As a start, I'd trace these calls:

These should tell us if the bibliography tag gets initialized/rendered repeatedly. If the builds take incrementally longer, I'd assume that either the rendering happens repeatedly or, if it does not, then the next step would be to time the individual steps of the render method; especially here:

I'd time the individual steps to see what's consuming incrementally more time then dig down further.

Before rendering the bibliography, styles and locale must be loaded. Currently, both are loaded into static memory, so on incremental builds the styles and locale files should not have to be reloaded. This should make successive builds quicker, but since that's not the case it's also worth testing this assumption. Styles are loaded into this static hash:

I'd also put breakpoints or debug output there to check whether or not the styles are being loaded for every build; I doubt this is directly related to the issue, but it would be good to know whether the plugin is completely reloaded during incremental builds or whether the styles stay in memory.

sneakers-the-rat commented 2 years ago

OK I was able to profile the build like this (my first ruby!)

require "jekyll"
require "ruby-prof"

result = RubyProf.profile do

  options = {
        "source"      => './',
        "destination" => './_site',
        "livereload"  => true,
        'watch' => true,
        'verbose' => true
      }

  Jekyll::Commands::Build.process(options)
end

stack_printer = RubyProf::CallStackPrinter.new(result)
flat_printer = RubyProf::FlatPrinter.new(result)
graph_printer = RubyProf::GraphHtmlPrinter.new(result)

File.open("_profile_stack.html", 'w') { |file| stack_printer.print(file) }
File.open("_profile_flat.txt", 'w') { |file| flat_printer.print(file) }
File.open("_profile_graph.html", 'w') { |file| graph_printer.print(file) }

and so I profiled it doing a single build (144s) and a run where I edited the document so there were two builds (145s and 376s)

I've attached the profiling results -- three formats, detailed in ruby-prof docs

It looks like BibTeX is calling the children of BibTeX::Entry#parse_month 3x as much in the double build than the single build:

Single build:

 %self      total      self      wait     child     calls  name
 18.14     97.151    27.229     0.000    69.922 27506072   BibTeX::Value#eql?
 17.09     65.541    25.660     0.000    39.881 27624376   BibTeX::Value#to_s
 11.18     25.221    16.790     0.000     8.431 27624376   BibTeX::Value#value
  6.50    107.606     9.766     0.000    97.840    24472   BibTeX::Entry#parse_month
  5.62      8.432     8.432     0.000     0.000 27627632   BibTeX::Value#atomic?
  5.42      8.137     8.137     0.000     0.000 55423220   Hash#key?
  4.71      7.064     7.064     0.000     0.000 29489151   Symbol#to_s
  3.14      4.709     4.709     0.000     0.000 27506343   String#eql?
  1.27      1.909     1.909     0.000     0.000  6036068   StringScanner#scan
  1.17      2.265     1.752     0.000     0.513  2004900   BibTeX::Lexer#push
  1.06      3.933     1.593     0.000     2.341  1065795   BibTeX::Lexer#parse_bibtex
  0.89    119.409     1.330     0.000   118.079    23124  *Racc::Parser#_racc_do_parse_c

Double build:

 %self      total      self      wait     child     calls  name
 19.82    307.347    85.552     0.002   221.793 82537056   BibTeX::Value#eql?
 18.99    206.892    81.963     0.003   124.926 82655352   BibTeX::Value#to_s
 11.97     78.859    51.656     0.007    27.196 82655352   BibTeX::Value#value
  6.95    338.150    29.981     0.002   308.167    24472   BibTeX::Entry#parse_month
  6.36     27.479    27.467     0.012     0.000 165484755   Hash#key?
  6.30     27.197    27.197     0.000     0.000 82658608   BibTeX::Value#atomic?
  4.41     19.031    19.027     0.004     0.000 83907880   Symbol#to_s
  3.53     15.255    15.251     0.004     0.000 82537201   String#eql?
  0.46      1.979     1.979     0.000     0.000  6034576   StringScanner#scan
  0.44      2.462     1.916     0.000     0.546  2004900   BibTeX::Lexer#push
  0.39      4.164     1.699     0.000     2.465  1065795   BibTeX::Lexer#parse_bibtex
  0.33    348.408     1.412     0.000   346.996    23124  *Racc::Parser#_racc_do_parse_c

I don't have more time now but thank you for your explanation, will dig into the code and try the more explicit breakpoints later if that would be helpful :) Not sure why the children would be getting called 2x as often as they should when parse_month is called the same amount, but i also haven't looked at why it's happening! Scholar is the thing that is taking by far the longest out of anything in the build (but i really appreciate your work on it!), and so it would be nice to figure out some way to speed that up and i'll set aside some time to help on that :)

inukshuk commented 2 years ago

Thanks! I'd run this a few more times. The incremental build-up you describe in the OP would suggest some conceptual bug that causes incremental builds to take successively longer and should become more apparent with each extra build.

So far this points looks like the time is being spent parsing the bibliography. This is understandable and something that has not been optimized: for each run the bibliography is parsed again, because the BibTeX file may have changed; during builds with multiple sources the bibliography may also be parsed multiple times because the parse results are discarded in between. Both could be addressed by caching the parse results and utilizing the Cache API to detect changes on the source files. Still, just parsing the bibliography over and over again should not lead to increasingly longer build times, so there must be something else going on in addition to this.

sneakers-the-rat commented 2 years ago

sorry mebs i wasn't clear! there are two separate runs,

so things in the second run (with two builds) that take 3x as long instead of 2x as long are the things that cause the time to increase. I've run it again and every time those same calls take longer :) will be tied up for another few weeks but want to help make this tool work!

sneakers-the-rat commented 2 years ago

This seems to work? https://github.com/sneakers-the-rat/jekyll-scholar/commit/4d63f88c909b136fd0b0b5533e7f7a0d3dc1b3f2

went from

Configuration file: _config.yml
            Source: 
       Destination: /_site
 Incremental build: enabled
      Generating... 
                    done in 48.744 seconds.
 Auto-regeneration: enabled for ''
LiveReload address: http://127.0.0.1:35729
    Server address: http://127.0.0.1:4000/
  Server running... press ctrl-c to stop.
        LiveReload: Browser connected
[2021-10-08 18:02:14] ERROR `/favicon.ico' not found.
      Regenerating: 1 file(s) changed at 2021-10-08 18:02:25
                    _sections/2_2_federation.md
                    ...done in 111.003382 seconds.

      Regenerating: 1 file(s) changed at 2021-10-08 18:04:22
                    _sections/2_2_federation.md
                    ...done in 191.76235 seconds.

to

Incremental build: enabled
      Generating... 
                    done in 0.043 seconds.
 Auto-regeneration: enabled for '
LiveReload address: http://127.0.0.1:35729
    Server address: http://127.0.0.1:4000/
  Server running... press ctrl-c to stop.
        LiveReload: Browser connected
[2021-10-08 18:14:38] ERROR `/favicon.ico' not found.
      Regenerating: 1 file(s) changed at 2021-10-08 18:14:49
                    _sections/2_2_federation.md
                    ...done in 6.407052 seconds.

      Regenerating: 1 file(s) changed at 2021-10-08 18:14:58
                    _sections/2_2_federation.md
                    ...done in 6.221516 seconds.

[2021-10-08 18:15:12] ERROR `/favicon.ico' not found.
      Regenerating: 1 file(s) changed at 2021-10-08 18:15:29
                    _sections/2_2_federation.md
                    ...done in 5.949184 seconds.

not bad! I really truly do not know how to program in Ruby so this probably breaks other stuff, but it seems to at least indicate that the problem is somewhere in the way the BibTeX library is parsing things, or how often it's called, etc.

sneakers-the-rat commented 2 years ago

should I PR?

inukshuk commented 2 years ago

Yes, it would be great to use caching for the bibliography. Though we should use the bibliography paths as the cache key in order to support parsing multiple bibliographies. We'd also have to look into how we can clear the path if the bibliography file changes -- though we could probably just document this (i.e. you need to restart if you change the bibliography file).

sneakers-the-rat commented 2 years ago

aha makes sense, still not really sure how ruby works, specifically how instances vs classes work exactly, but i can take a crack at it.

I tried to mirror your cache clearing code, which seems like it checks if the paths change, but yes I don't know how to invalidate if contents of each file change, will investigate. For now it seems to be working, it does have missing references, presumably from not clearing the cache, but if i do a restart then it catches all the changes after that, so not quite sure how that's working.

sneakers-the-rat commented 2 years ago

circling back around, i'm going to work on some other extensions, and wanted to know if you wanted me to work this into a PR? It still suffers from not adding new citatations that are added after jekyll serve is called, which i imagine is from not properly clear the cache. I recognise you're probably pretty strapped for time, but if you have any pointers on what cache needs to be invalidated i'd be happy to PR :)

inukshuk commented 2 years ago

@sneakers-the-rat sorry, I haven't been following this recently, but having looked at it quickly, I think the blocking issue in the other thread is resolved now. Would be more than happy to merge a PR opting into the cache API.

opyate commented 2 years ago

I tried the OP's code too. Before:

done in 330.149 seconds.

After

done in 23.193 seconds.

I would be very happy if this could be integrated :+1: