inukshuk / citeproc-ruby

A Citation Style Language (CSL) Cite Processor
101 stars 22 forks source link

performance, possible improvements #53

Closed jrochkind closed 6 years ago

jrochkind commented 6 years ago

I'm evaluating using this in a server side (web) application. For my use case, I just actually need to take a single citation (which I already have in a csl-data-json hash), and output it in one of a handful of styles. Following the directions in the README, I get:

csl_hash # a csl-data-json style hash, assume exists
cp = CiteProc::Processor.new style: 'chicago-note-bibliography', format: 'html'
cp.import [csl]
cp.render :bibliography, id: csl.first["id"]

This works, but it's slow. On my macbook, 100-200ms. Too slow for me.

However, good news, profiling reveals that it's slow mostly only because each time it loads and parses the style and locale files. The slow is in CSL::Loader.load, which is called twice, once with chicago-note-bibliography and once with the (default) locale en-US.

If just to play around, I modify CSL::Loader.load to cache in memory, it saves an order of magnitude of time, down to 10-20ms to process.

One probably wouldn't want to actually do that, because then any program using would cache whether it wanted to or not, eventually holding all styles and locales ever used in memory. Plus there could be concurrency issues.

But it would be good if there were some way for the caller to load the style/locale manually, and pass it in. Perhaps something like:

chicago_note_bibliography = CSL::Loader.load("chicago-note-bibliography")
locale = CSL::Loader.load("en-US")

# now I can re-use them to my hearts content over and over,
# without paying the loading cost...

 cp = CiteProc::Processor.new style: chicago_note_bibliography, locale: locale,  format: 'html'
 cp.import [csl]
 cp.render etc

  # do it again without reload

 cp = CiteProc::Processor.new style: chicago_note_bibliography, locale: locale,  format: 'html'
 cp.import [another_csl]
 cp.render etc

That's not quite right, CSL::Loader is actually a module, I got to figure out where it is used, plus of course this just doesn't work at present. It could probably be made to work, I started looking at the code, and got a bit lost in all the objects across three gems involved, but perhaps it could be done fairly straightforwardly. I'm still investigating, toward a potential PR (in one or more of the involved gems).

Or maybe there's other existing API not mentioned in the README that would be better to use for this, either as-is, or with some modification.

Do you have any thoughts? Thanks!

jrochkind commented 6 years ago

Hmm, something close to this may almost work already. It's actually:

chicago_note_bibliography = CSL::Style.load("chicago-note-bibliography")
locale = CSL::Locale.load("en-US")

If I pass those in to cp.render, the style is not loaded again, it uses the one I already loaded. (not totally sure if this is actually thread-safe though. Also not sure if it just works accidentally).

the locale is still loaded again though, somewhere it turns the locale I pass in back to a string and then it's loaded again using that string as input.

Hmm.

inukshuk commented 6 years ago

It helps to distinguish between what I call the citation 'renderer' and 'processor' part. The renderer implements CSL and is fairly complete; the processor has many other concerns that will vary a great deal based on your application: parses input, processes output, keeps track of which items have been cited before, and their order (for stuff like 'ibid.'), decides if disambiguation is necessary etc. You'll notice that the renderer, in theory, can be almost stateless, whereas the processor is inherently stateful. At the moment, the processor part is very limited in citeproc-ruby, because my own use cases have always been limited to generating bibliographies in one go as opposed to managing all the citations in a document and the corresponding bibliography, based on dynamic input.

That said, you're absolutely right: parsing styles and locales takes a long time and it obviously makes sense to keep instances around. In the context of the library, this would happen in the processor: you create a processor instance with style and locales and use for all your processing needs. If your use case is inherently a stateless one, like you say, you're in the position I usually find myself in as well: you just need to render single (or sometimes multiple) citations and then forget about them (i.e., you're not concerned about disambiguation, ibidems and all that stuff). In this case, typically you don't need a processor at all: this is basically the conclusion you arrived at, I just wanted to give you my rationale for it. So you'd load the styles and locales you need, create a renderer, and use those instances.

If you do not modify the styles or locales they should be thread safe; the renderer needs to track some state while it renders, so I would definitely put use a thread local instance for it (to be on the safe side, it would probably be best to keep style and locale in thread local too, especially if you don't do any dynamic modifications, it's safer and only a little memory overhead).

With the caveat that I'll have to familiarize myself with the code again, I would say that there is no reason for the renderer to keep loading the same locale again, but that's something we can figure out.

inukshuk commented 6 years ago

As an example, here is how we create the renderer in jekyll-scholar and then use it to render bibliographies.

jrochkind commented 6 years ago

I found the code where it insists on reloading the locale, which I can change to be analagous to the code for loading the style, which does not. I'll make a PR.

As far as your example from jekyll-scholar (which is helpful, thanks! And I think would be a helpful example in the README for one of the involved gems maybe?), my concern is that a CiteProc::Ruby::Renderer may not be safe to use concurrently between multiple threads. That might not matter in jekyll, but does matter in Rails or other web served apps, unless you do some non-trivial and error-prone things to try and make it not matter.

The solution "cache it in Thread.local, not globally" doens't really work, because depending on server environment, you can't neccesarily depend on threads being re-used at all. Caching things in Thread.local can end up being equivalent to re-creating it for every request, if every request ends up getting a brand new thread. It's not a super reliable solution.

inukshuk commented 6 years ago

The renderer is definitely not thread-safe, because it keeps some state during the rendition: to render an item, the renderer basically walks down the CSL nodes; some state is stored in the citation item as the renderer visits each node; this is probably thread safe even for the more complicated CSL features like groups, because you can make sure not to share citation items between threads. But there is also some state kept in the renderer instance. To make rendering thread safe we'd have to pass down the state to every node which may not be easy to do conceptually, because order is really important: e.g., if you have a style that replaces an author's name in subsequent references with something like ----. So I think we'd have to use a mutex for the render method to make it thread safe.

Looking at some of the render methods, you see the state being stored and then at the end cleared again. It's this state, I think which is a conceptual problem so those methods would have to be protected by a mutex.

jrochkind commented 6 years ago

OK, thanks. Presumably that also means a CiteProc::Processor is not thread safe, since it uses a renderer. But no good way to share that anyway, since a given processor already has a certain list of known records in it's list.

By making sure to re-use Locale and Style (which we think are probably thread-safe, so long as you aren't intentionally mutating them), creating a new CiteProc::Processor and using it gets me down to reasonable if not awesome performance -- 10-20ms instead of the 100-200ms it took including loading the Style and Locale. So I think that's good enough for now!

I will PR a change to citeproc-ruby to allow you to re-use an already loaded locale with a new renderer/CiteProc::Processor. It is simple.

jrochkind commented 6 years ago

@inukshuk hmm, in your example of using the Renderer directly -- what data structure has to be passed in to the renderer, in your code what sort of object is citation_item_for(entry, index). When I try passing in an ordinary ruby hash (that is in csl-data.json format), I get NoMethodError: undefined methoddata' for #`.

Ah, okay, I see in your code I need to create a CiteProc::CitationItem.new. Which looks maybe straightforward to do from a csl-data hash. I'll try that next!

I wonder if it would be convenient to have a simpler more convenient api for "render a single citation from csl-data hash".

jrochkind commented 6 years ago

got it working. hmm, making it work for "re-use already loaded locale/style" is actually harder at this lower level using a renderer directly, than it was using the higher level CiteProc::Processor. But I'm gonna figure it out.

This code gets a bit... tangled.

inukshuk commented 6 years ago

While I blush a bit at the 20ms, I'll say that the library has not been optimized for performance yet at all -- there is much room for improvement (but with CSL/CiteProc there are also many rabbit holes, in my experience!).

I do agree that it would be nice to have a simple API to 'just render this item' -- it is the use case that I think a lot of people want. PRs much appreciated, if you want to take a stab at it.

jrochkind commented 6 years ago

Actually, using the lower-level Renderer API and using a patch to let it use already-loaded locale, I'm getting down to 7-9ms. So much better than the 100-200ms I started out with! Trying to figure out the best place to actually PR the patch.

Definitely either the README needs an example of how to do this in a performant way, and/or it needs a simpler API. Maybe both. I'll see what I can do once I get the first patch in.

jrochkind commented 6 years ago

@inukshuk in the example you showed me before, you pass a style: argument to Renderer.new:

@renderer = CiteProc::Ruby::Renderer.new :format => 'html', :style => style, :locale => config['locale']

But as far as I can tell, that style argument is completely ignored, it's not actually saved anywhere.

Instead, when you actually call render on the renderer, you need to pass in styles(style).bibliography as the second arg. I'm not sure what you can that style#bibliography thing ; is that a style too? A substyle? A "mode"? And it can apparently be different on every call to render, it does not need to be set on Renderer.new (and in fact cannot be even as a default, that style arg is simply ignored).

Does this seem like I have it right? Thank you!

inukshuk commented 6 years ago

Oh, yes, that sounds right -- you actually don't pass a default style to the renderer. In CSL there are two root rendering nodes: citation and bibliography (for citations and references which go into a bibliography); you pass one or the other to the render function based on what you want to render.

mattluce commented 6 years ago

So I'm not clear from this thread, is there a way to pass in the locale/style with the existing code? Any performance gains would be welcome.

jrochkind commented 6 years ago

@mattluce style, yes. locale, no.

Pretty much anywhere in existing code you pass in a style as a local file path or URL, you can also pass in a CSL::Style (that you loaded previously with CSL::Style.load), and it will use it, without having to load it again.

By "pretty much anywhere", I mean the places I was using/testing, and the places I found in the code. :)

Locale, you can pass in a CSL::Locale -- but the existing code will end up trying to load it again anyway (often succeeding, if you loaded it from local path, so you won't see an error, but it'll be extra time to (re)-load). #54 fixes that.

mattluce commented 6 years ago

Unfortunately, I'm not seeing any performance gains when reusing a loaded style object. cp = CiteProc::Processor.new style: $style, format: 'html'

Anyway, performance improvements would be greatly welcomed. I'm not a Ruby developer, otherwise I'd be happy to contribute.

jrochkind commented 6 years ago

@mattluce interesting, I definitely saw big improvements, and benchmarked it in my local test sandbox app.

(Of course, you've still got to load the style once, so if you benchmark loading the style in advance plus doing ONE processor instantiation, it will be exactly the same as one processor instantiation loading the style itself. Perf gains are only there if you load a style and then re-use it with multiple processors and/or renderers).

Let me check my test code to see what API's I was using... yep, it was the same one as you CiteProc::Processor.new style: $style, format: 'html', as well as additionally some other API with a direct CiteProc::Ruby::Renderer.new. I suggested an API to make using a renderer directly a bit easier at #56, and some (IMO) README improvements at #57

jrochkind commented 6 years ago

Hey @inukshuk, any chance of getting a release with the CSL::Locale re-use perf improvement in there from #54?

I was thinking it might make sense to get some of the other PR's I suggested in before a release, but I guess those are more controversial/challenging/need-thought. #54 is already merged, and I could really use it in a release, and don't want to have my production app depend on a fork! What do you think?

inukshuk commented 6 years ago

Yes, sorry, I'm running late here. I'll merge #54 and release it tomorrow!

jrochkind commented 6 years ago

Thanks @inukshuk! You actually already did merge #54! it's in master, it just needs a release, heh.

inukshuk commented 6 years ago

Ha, indeed! I actually pushed a release with this as well -- 1.1.10 should include #54 already.

For the other changes, I want to finish date ranges first and then release those together.

jrochkind commented 6 years ago

Oh, okay, great, thanks! The other stuff is added bonus, but #54 is all I need to put my feature in production.

For now, I'm formatting date ranges myself and adding them to a literal field in the citeproc-json date, which works fine for my specific needs. If anyone is curious: https://github.com/sciencehistory/chf-sufia/pull/1027

I'm going to close this ticket, thanks! And thanks for your advice and help @inukshuk!

jrochkind commented 6 years ago

I wrote up technical details on our citation implementation in a sufia app, using citeproc-ruby, if anyone is interested. https://bibwild.wordpress.com/2018/04/04/another-round-of-citation-features-in-a-sufia-app/