inukshuk / jekyll-scholar

jekyll extensions for the blogging scholar
MIT License
1.13k stars 101 forks source link

Incremental build affected by large bib file #256

Open thrau opened 5 years ago

thrau commented 5 years ago

I have a 700kb bibtex file with about a thousand entries, and one file to render it. So building the entire source is naturally a little slow (22 seconds).

However I found that jekyll --watch --incremental takes the same amount of time when building files that have no dependencies to the bibtex file.

My publications.md file is below. Interestingly, when I add a query to filter, e.g., only publications from 2018 (50-100 or so), the build speeds up drastically (22s -> 2s).

Any idea what the problem could be? In particular that the incremental build of unrelated files is affected by the amount of bibtex entries rendered in other files seems odd to me. I'm not familiar enough with jekyll to understand whether this is a jekyll related problem or has something to do with the plugin.

---
---

## Publications

{% bibliography %}
inukshuk commented 5 years ago

Is it possible that the detail pages are being generated? If I remember correctly, all the generator plugins will run regardless of which sites need to be built. The long time it takes is probably caused by loading the style and processing the references; you're right that this should not be necessary unless building pages containing the actual references, so it would be great to improve that!

thrau commented 5 years ago

detail pages are not being generated in my configuration.

cardi commented 5 years ago

I have a similar experience in lengthy build times that likely has to do with generating multiple bibliographies from citations across different pages, in addition to a monolithic page that iterates through all references.

Using Jekyll 3.8.5 with jekyll-scholar 5.14.1.

references.bib contains 203 entries, 197KB, with an ACM SIG proceedings style.

Some (rough) benchmarks:

Given that most of the detail pages won't change too often from the underlying BibTeX or style used, a one-time expensive cost in the initial generation is manageable.

Generating references and detail pages might benefit from the upcoming Jekyll 4.0 Cache API.

I'm not familiar with Ruby or the internal workings of jekyll/jekyll-scholar, but if you can point to where I or someone else might start, that would be helpful.

inukshuk commented 5 years ago

To speed-up the generation of detail pages, you could add some conditions around here). After generating the detail pages, we could write some kind of manifest or save a timestamp which we could compare to the modification date of the bib file. That way, we'd generate details only if the bib file has changed since the last time the detail pages were generated. (A more granular approach, at the entry level, is probably not worth the effort.)

cardi commented 5 years ago

Thanks. That seems like a reasonable approach–I'll see if I can't make a first pass over the next few days.

The other aspect in build times is, I think, building bibliographies from citations (e.g., {% bibliography --cited %}).

If all the entries in the bibliography are parsed, and then references for each cited entry are being built each time the bibliography command is called, I could see caching the entry in some way to plausibly save lots of time.

cardi commented 5 years ago

I've prototyped something quickly to use the Cache API when generating details pages. The results are looking very promising:

files total (sec) average (sec) median (sec) min (sec) max (sec)
first run (all cache misses) 206 205.353 0.996859 1.007 0.333644 2.166210
second run (all cache hits) 206 0.040451 0.000196364 0.000186 0.000141 0.000586

Results may vary, since the underlying cache is loading each entry from disk the first time it's called (perhaps as the Cache API evolves, jekyll could warm up the cache by loading the entirety of the cache from disk into memory, or using a different backing store, but I don't anticipate working on that anytime soon.)

This should work well, especially for incremental builds: jekyll+jekyll-scholar will only build new BibTeX entries.

Some edge cases I haven't quite thought about yet, that won't trigger a rebuild of the details pages:

I'd expect the above operations to happen rarely, so I think incurring the expensive cost is OK, but at the moment there are two ways to trigger a complete rebuild:

  1. delete .jekyll-cache directory
  2. modify _config.yml

Perhaps there will be a flag that one can pass to jekyll build that clears the cache when 4.0 is released.

inukshuk commented 5 years ago

Looks great! Did you figure out where scholar was modifying site.config?

Regarding the cache invalidation, perhaps we could create some kind of manifest file for the details pages with a checksum of the BibTeX file? That way we could detect when a rebuild is required.

cardi commented 5 years ago

Looks great! Did you figure out where scholar was modifying site.config?

I haven't, but I plan on taking a closer look after I've polished the caching code.

Regarding the cache invalidation, perhaps we could create some kind of manifest file for the details pages with a checksum of the BibTeX file? That way we could detect when a rebuild is required.

That seems like a good approach that will take care of most of the issues, even if it is a bit heavy-handed. I suppose we could do the same with the layout for the details page.

Another, maybe easy approach that I've just thought of is to cache the hashes of each BibTeX entry: if the cached hash doesn't match or doesn't exist, then re-build that particular entry. I think this would only work if the BibTeX object (dictionary?) in Ruby is consistently ordered in a deterministic way.

sneakers-the-rat commented 2 years ago

@cardi still working on this? want to join forces on a PR with what we're talking about here? https://github.com/inukshuk/jekyll-scholar/issues/335

cardi commented 2 years ago

@cardi still working on this? want to join forces on a PR with what we're talking about here? #335

It's been a while since I've looked at this, and I'm still interested in having this feature implemented.

I made a first pass at using Jekyll's Cache API here: https://github.com/cardi/jekyll-scholar/tree/cached-details, but a critical blocker (that may or may not have been resolved since) is that any change to site.config internally will invalidate the cache and rebuild everything.

While I documented the issue and my findings in https://github.com/inukshuk/jekyll-scholar/issues/262, I don't have a proposed fix for it. (Maybe storing some of jekyll-scholar's settings in a different variable outside of site.config?)

I think https://github.com/inukshuk/jekyll-scholar/issues/262 has to be resolved before caching can be implemented and used.

inukshuk commented 2 years ago

@cardi I took a quick look at this and I think that the BibTeX converter merged in the default scholar config during initialization. Give it another go, to see if this fixes the issue you'd been seeing.