inukshuk / jekyll-scholar

jekyll extensions for the blogging scholar
MIT License
1.13k stars 101 forks source link

Bug with Jekyll v4 and parsing encapsulated BibTeX entries #286

Closed kellertuer closed 4 years ago

kellertuer commented 4 years ago

I just noticed a bug where I am not sure whether that's due to Jekyll or the parsing done in Jekyll-Scholar.

Imagine you have an Entry like

@article{ABC,
author = {Arrr, Pirate and BC, Anton),
title = {{R}iemannian insights}
}

Where we want to secure the capitalisation of the first letter of the title (or a few, since the first might be always capital, but what if a title starts with an abbreviation in capitals?)

This yields when generating the bibliography on a page

Liquid Exception: Liquid syntax error (line 18): Variable '{{R}' was not properly terminated with regexp: /\}\}/ in pages/publications.md
                    ------------------------------------------------
      Jekyll 4.0.0   Please append `--trace` to the `build` command 
                     for any additional information or backtrace. 
                    ------------------------------------------------

can one for example parse the title into something without {} before printing it?

inukshuk commented 4 years ago

Odd, this has been reported a few times with Jekyll 4, but we added a test case to ensure this works and, as far as I can tell, the issue somehow resolved itself for those reporting it.

For context, see the recent posts at #242

kellertuer commented 4 years ago

Thanks for the fast response, in #242 it seems to be resolved by newly exporting. However, my entries are hand crafted and indeed meant to really be {{ABBR} is something}, where I want to keep the abbreviation ABBR to stay capitalised. And yes it only seems to occur on pages where I have cite and bibliography in use as you said over there. The error is indeed, that the (unchanged) title gets interpreted as Liquid and then – of course – its not a proper {{ ... }} format. Surprisingly that did not happen before I updated.

inukshuk commented 4 years ago

What I said in the other thread is that this should work fine if you use the cite or bibliography tags. It's only if you feed the bib file directly into jekyll that double braces would be a problem.

The test case I linked to above has plenty of double braces and it works just fine.

kellertuer commented 4 years ago

Ok, then I have to check, because I did never intend to feed the bib file directly into Jekyll nor was I aware that I do that.

michele-segata commented 4 years ago

I have a problem that I believe is related to this (if not, please apologize). After upgrading from jekyll 3.8.3 to 4.0.0 and from jekyll-scholar 5.14.0 to 6.4.0, I have seen that the use_raw_bibtex_entry config has been removed. I see that the comment of 985466e says that "Jekyll 4 does not render Liquid markup from included BibTeX files any longer." (I am using ruby 2.5.1). This doesn't seem to be true for me. Indeed, all my entries have a protected capitalization on the title field (e.g., title = {{Title of Article}},), and when I use {{entry.bibtex}} I get title = , in the page.

@inukshuk I don't understand what you mean when you say

It's only if you feed the bib file directly into jekyll that double braces would be a problem.

I don't want to change the format of the files because the .bib file is automatically fetched from a database, and I want people to download my bibtex with the double braces, otherwise they're going to get the title wrong in their bibliographies.

inukshuk commented 4 years ago

There is one situation, that I'm aware of, where there's no simple way to avoid liquid parsing double braces in a BibTeX file: the one that I've been describing as 'feeding the file directly into jekyll'. Perhaps it's easiest to explain this with scenario with an example.

In this example, we have a file 'references.bib' sitting in the project root folder. When you run jekyll it treats this as a page, sees the '.bib' extension and runs it through one of the converters provided by jekyll-scholar. This results in a file 'references.html'. All the entries in the file will be converted and all the 'comments' (BibTeX treats everything outside of entries as comments!) is the rest of the page. This was my original use case for jekyll-scholar, but I doubt many people actually use it this way. In any case, in such a scenario, I think that liquid parses the file for tags before the converter kicks in so you can't really use double braces here.

That said, if you're using jekyll-scholar as in this example, double braces should not be a problem? Here we have a file '_bibliography/references.bib' with lots of double braces and a page 'test.md' which uses the bibliography tag to print the bibliography. Note that the BibTeX file is not exposed to jekyll or liquid directly: it's loaded by the bibliography tag.

Your case is using a custom template. To be sure that this works I added another example to our tests.

Having written all this, I've realized your error only occurs when using entry.bibtex and, it's true, in this case the liquid tag seems to be interpreted one more time. @stevecheckoway I think this solves the mystery why we introduced the option in the first place.

Instead of bringing back the option, I added a raw_bibtex to the entry. This way you can chose which one to use in the template. Here is an example of this feature in use. I'll push this to RubyGems in a moment.

inukshuk commented 4 years ago

@michele-segata entry.raw_bibtex should be available in version 6.5.0. Thanks for helping clear this up!

stevecheckoway commented 4 years ago

@inukshuk, just to double check, the issue is that using entry.bibtex inside a template causes liquid to be parsed a second time?

I can't recall off-hand how bibliography_template works, but I'd assume that template would be rendered once with the appropriate entry and then the result would be inserted into the output, but not subject to another liquid expansion. But maybe that's wrong.

In any case, I think the example should test that raw doesn't appear in the output just to make sure this behavior doesn't change.

It also occurs to me that the warning message for the use_raw_bibtex key should be changed to inform the user about the new entry.raw_bibtex, if you didn't do that already.

michele-segata commented 4 years ago

@michele-segata entry.raw_bibtex should be available in version 6.5.0. Thanks for helping clear this up!

@inukshuk I confirm that this is now fixed. Thank you for fixing this!

inukshuk commented 4 years ago

@stevecheckoway I checked again and we actually did render the template twice. This was almost certainly in order to simulate the previous behavior of jekyll where this also happened. Remember, we had those tests that verified e.g., that liquid filters inside BibTeX worked? Since we decided to drop this highly questionable feature anyway, I just removed the double rendering and then everything works as expected.

@michele-segata sorry about the noise, but this means we'll remove entry.raw_bibtex again in 6.5.1 before anyone gets a chance to use it. entry.bibtex should work there as it did before.

michele-segata commented 4 years ago

@michele-segata sorry about the noise, but this means we'll remove entry.raw_bibtex again in 6.5.1 before anyone gets a chance to use it. entry.bibtex should work there as it did before.

Not a big deal. The important thing is that this works now. Thanks again.

michele-segata commented 4 years ago

I have to say that the commit unfortunately broke something else. The double parsing was actually very useful. I was exploiting it, but didn't remember. Basically, I have a personalized style.csl file in which I have some liquid tags. For example, in my .csl I have something like:

  <macro name="author">
    <names variable="author" prefix="{{ site.author.prefix }}" suffix="{{ site.author.suffix }}">
      <name and="text" initialize-with=". "/>
      <label form="short" prefix=", " text-case="capitalize-first"/>
      <substitute>
        <names variable="editor"/>
        <names variable="translator"/>
      </substitute>
    </names>
  </macro>

Then, inside my _config.yml I define those variables as

author:
  prefix: "<span class=\"author\">"
  suffix: "</span>"

so that when I use {{ reference }} in my template I get

<span class="author">M. Segata</span>

Then I can personalize each field of the reference using my CSS file. Without the double parsing, this doesn't work anymore and I get

{{ site.author.prefix }}M. Segata{{ site.author.suffix }}

inside the webpage. I could fix this by putting

    <names variable="author" prefix="<span class=\"author\">" suffix="</span>">

inside my style file but besides the fact that hardcoding this is not elegant because some attributes (like the title) appear multiple times inside the style file, it doesn't even work. Jekyll fails with the following error

Liquid Exception: not allowed to add #<CSL::Style::Name and="text" initialize-with=". " children=[0]> to #<struct CSL::Style::Children :"style-options"=nil, info=#<CSL::Info children=[8]>, locale=[#<CSL::Locale en-US>], macro=[#<CSL::Style::Macro name="edition" children=[1]>, #<CSL::Style::Macro name="issued" children=[1]>, #<CSL::Style::Macro name="author" children=[2]>], citation=nil, bibliography=nil> in /path/to/my/site/publications.md

This said if you have a better way to achieve what I want to achieve, I can change my solution according to your suggestion. If not, I would suggest to re-introduce the double parsing and the entry.raw_bibtex variable in 6.5.2.

stevecheckoway commented 4 years ago

Double parsing is going to break everyone else.

I can't remember the exact XML rules, but doesn't

<names variable="author" prefix="<span class=\"author\">" suffix="</span>">

need to be

<names variable="author" prefix="&lt;span class=&quot;author&quot;&gt;" suffix="&lt;/span&gt;">

I guess an opt-in option to double parse along with a raw_bibtex could work, but I'm not convinced this is the best option. The CSL plus appropriate CSS seems better to me.

michele-segata commented 4 years ago

If I use things like &lt; inside the CSL then it is "copy-pasted" as is inside the generated HTML, which shows you the HTML code in the webpage. So it doesn't work.

stevecheckoway commented 4 years ago

That sounds like a bug in whatever is parsing the XML.

irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> doc = Nokogiri::XML.parse('<names variable="author" prefix="&lt;span class=&quot;author&quot;&gt;" suffix="&lt;/span&gt;">')
irb(main):003:0> doc.at('/names/@prefix').value
=> "<span class=\"author\">"
inukshuk commented 4 years ago

This happens in citeproc-ruby's HTML formatter. I'm not sure now if this mandated by the CSL spec or not. We could either fix this or add an option to turn it off.

stevecheckoway commented 4 years ago

https://citationstyles.org/authors/ says it's XML in a bunch of places so I assume it should follow the XML spec. I'm not sure if there's an actual CSL standard though.

Edit: I spoke too soon. Here it is. http://docs.citationstyles.org/en/1.0.1/specification.html

inukshuk commented 4 years ago

http://docs.citationstyles.org/en/1.0.1/specification.html

The style is XML, but this happens when formatting the reference: the style is rendered using the HTML formatter: that formatter takes the prefix/suffix content and prints it. But since the formatter generates HTML it specifically escapes the known entities again.

inukshuk commented 4 years ago

This happens here. We could also ship our own version of the HTML formatter with jekyll-scholar that does not do that.

stevecheckoway commented 4 years ago

Ah, I see what you mean. Hmm. That's tricky. My guess is producing valid HTML in all cases is the right thing to do which means escaping entities.

I'm not sure what the right way to support this use case is. Maybe run liquid with the site variables over the CSL if requested?

inukshuk commented 4 years ago

I think this would be simple solution:

require 'cgi'
require 'citeproc/ruby'

class CiteProc::Ruby::Formats::Html
  def prefix
    CGI.unescape_html options[:prefix]
  end

  def suffix
    CGI.unescape_html options[:suffix]
  end
end

If you require this file before or after jekyll-scholar all prefix/suffix attributes in your CSL styles should be unescaped.

@michele-segata want to give this a try?

michele-segata commented 4 years ago

I think this would be simple solution:

require 'cgi'
require 'citeproc/ruby'

class CiteProc::Ruby::Formats::Html
  def prefix
    CGI.unescape_html options[:prefix]
  end

  def suffix
    CGI.unescape_html options[:suffix]
  end
end

If you require this file before or after jekyll-scholar all prefix/suffix attributes in your CSL styles should be unescaped.

@michele-segata want to give this a try?

@inukshuk yep, I can confirm this works fine.