retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.19k stars 283 forks source link

Math formatting lost on import #627

Closed jsdodge closed 7 years ago

jsdodge commented 7 years ago

I have a bib-file that I have edited by hand, with article titles that include the phrase "$t$-$J$ model". When I import the file into zotero using Better BibTeX, then export it to a new bib-file, this phrase is converted to "T-J model". Please help.

Example input:

@Article{Bonca:2012in,
  author  = {Bon{\v c}a, J and Mierzejewski, M and Vidmar, L},
  title   = {{Nonequilibrium Propagation and Decay of a Bound Pair in Driven $t$-$J$ Models}},
  journal = {Phys. Rev. Lett.},
  year    = {2012},
  volume  = {109},
  number  = {15},
  pages   = {156404},
  month   = oct,
  doi     = {10.1103/physrevlett.109.156404},
}

Example output:

@article{Bonca:2012in,
  title = {Nonequilibrium {{Propagation}} and {{Decay}} of a {{Bound Pair}} in {{Driven}} T-{{J Models}}},
  volume = {109},
  doi = {10.1103/physrevlett.109.156404},
  timestamp = {2017-01-14T19:06:14Z},
  number = {15},
  journal = {Phys. Rev. Lett.},
  author = {Bon{\v c}a, J and Mierzejewski, M and Vidmar, L},
  month = oct,
  year = {2012},
  pages = {156404}
}
jsdodge commented 7 years ago

Report ID: 44NEBR2W

retorquere commented 7 years ago

There's two issues at play here. One fairly simple, one more complicated.

Simple one first. The extra bracing ({{...}}) you have in your input could in principle be detected but you wouldn't believe how complex it actually is to parse (or generate) the meaning of braces outside LaTeX -- the double-bracing you see in the output is the result of long discussions to find something that works unambigously (so far...). If you're willing to do a one-time manual correction, you can change the title in Zotero to

<span class="nocase">Nonequilibrium Propagation and Decay of a Bound Pair in Driven $t$-$J$ Models</span>

This is supported Zotero formatting, and means "don't mess with the casing here", which BBT will dutifully apply by outer-bracing that whole piece and not fiddling with case anywhere.

The harder part is the math-mode stuff. Zotero doesn't have any facility to denote math characters so BBT tries to map the input as best it can to regular text. Math is a pain in Zotero any which way you use it -- there's no supported way to get it in, unless you fake it with unicode characters/nocase markup that sort of look like your equation -- in which case BBT will probably do the right thing. But that is not why we use LaTeX now is it?

If your concern is only the output, you can make BBT do that forcing LaTeX mode in your references by changing the title in Zotero to

Nonequilibrium propagation and decay of a bound pair in driven <pre>$t$-$J$</pre> models

But note that the <pre> syntax is BBT only and your references will look weird unless you export it with BBT.

jsdodge commented 7 years ago

Thanks, that was quick!

I'll make those changes, but I'd also like to leave you with a feature request: would it be possible to convert all roman letters (but not numbers) that appear in math mode into italicized versions? This would fix the problem, assuming it wouldn't create new ones.

More examples: italicize "T" and "c" in "high-$T_c$" italicize "x", but none of the numerals, in "YBa$_2$Cu$3$O${6+x}$"

retorquere commented 7 years ago

Urgh.. I thought I had it, but then the blessed test suite found this:

journal = {Actes du $4^{\textrm{ème}}$ Congrès Français d'Acoustique},

which is translated to this under the changes I tried to make for this issue:

Actes du 4<sup>è<i>me</i></sup> Congrès Français d'Acoustique

which is a) found in the wild, because that's how I got it, and b) clearly not desirable.

If you can think of a rule that gets your case right but doesn't mess with this, I'm open. I currently have tests running on a change that will only italicize $<one or more roman numerals>$, which would work on my current test set (I think, we'll know soon), but it smells a little ad-hoc-ish.

retorquere commented 7 years ago

Wow that latter change didn't work at all.

retorquere commented 7 years ago

OK, I have something that passes my tests at https://github.com/retorquere/zotero-better-bibtex/releases/download/builds/zotero-better-bibtex-1.6.91-br627-3454.xpi. You could give that a spin, but I'm not entirely committed to this solution yet.

retorquere commented 7 years ago

I can make a case to myself to do single-letter math mode I think... so $j$ would italicise, but $Jt$ not. Math-mode is abused in a whole lot of ways for formatting workarounds, I can't assume that every character is italicised, especially since the math-mode parsing is exceedingly dumb and doesn't understand that stuff like \textrm{...} would have to non-italicise etc.

jsdodge commented 7 years ago

Sorry for the long delay--I needed to move on from the reference DB to work on the paper that refers to it! I gather you moved on, too, since the link you posted last week doesn't work any longer, but I'd be happy to try it now if you're still interested in feedback.

FWIW, I'm finding that titles with markup are poorly done in most metadata, so I'm content to change the titles in Zotero by hand when necessary so that they are exported correctly in BibTeX. It may be enough for me to know the conversion rules so I can look up the appropriate markup when I need to.

Thanks for your prompt attention earlier, and apologies again for not reciprocating.

retorquere commented 7 years ago

I'm still not entirely happy with doing too much with math mode because it's abused too often to achieve non-math stuff in LaTeX, and I'd have to start recognizing where in math mode LaTeX would drop out of math mode, which is kind of hard to do without actually running it through LaTeX. Such are the perils of treating as markup what is actually a programming language. So if you're OK with post-import fixups, that would be my preference here. Doing math mode sort of properly would require a major reworking of the parser/generator I have.

The link is indeed gone, test builds are auto-deleted after a week because I would prefer they don't linger around; I can build a fresh one, but in general, either something gets fixed & folded into a new release, or I discontinue that line.

jsdodge commented 7 years ago

I can live with post-import fixups, and would appreciate some guidance on how to arrange my preferences to get most of the import right in the first place. For example, consider the following: https://doi.org/10.1103/PhysRevB.83.214515

Using 'Save to Zotero using "APS"', the title listed in Zotero is formatted according to APS recommendation (see the field of the page source): Photoinduced melting of superconductivity in the high-${T}<em>{c}$ superconductor La${}</em>{2\ensuremath{-}x}$Sr${}<em>{x}$CuO${}</em>{4}$ probed by time-resolved optical and terahertz techniques</p> <p>The title listed in the bib-file exported by Better-BibTeX is: Photoinduced Melting of Superconductivity in the High-\${\vphantom}{{T}}\vphantom{}_{c}\$ Superconductor {{La}}\${}_{2$\backslash$ensuremath{-}x}\${{Sr}}\${}_{x}\${{CuO}}\${}_{4}\$ Probed by Time-Resolved Optical and Terahertz Techniques</p> <p>You mentioned that I can surround the imported title by "<span class="nocase">...</span>", but is it possible to simply turn off the automatic export conversion for <em>all</em> titles? That would be my preferred solution. Unicode-LaTeX conversion in names works fine, so I'd prefer to leave that alone.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/retorquere"><img src="https://avatars.githubusercontent.com/u/132108?v=4" />retorquere</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>OK, so to see if I understand this correctly:</p> <ol> <li>You browsed to <a href="http://journals.aps.org/prb/abstract/10.1103/PhysRevB.83.214515">http://journals.aps.org/prb/abstract/10.1103/PhysRevB.83.214515</a>, and Zotero imported it using the "APS" importer (which is the default)</li> <li>You ended up with the title <code>Photoinduced melting of superconductivity in the high-${T}_{c}$ superconductor La${}_{2\ensuremath{-}x}$Sr${}_{x}$CuO${}_{4}$ probed by time-resolved optical and terahertz techniques</code> in Zotero (as did I when I did this)</li> </ol> <p>To start with, it's a little odd that Zotero/APS did this, because semantically what is now in the Zotero reference title simply means something very different from what was offered for import, as Zotero has no math support (and certainly would support mathml rather than LaTeX even if it did). If you render this to a bibliography using Zotero, you get this output:</p> <blockquote> <p><code>Beyer, M., Städter, D., Beck, M., Schäfer, H., Kabanov, V. V., Logvenov, G., … Demsar, J. (2011). Photoinduced melting of superconductivity in the high-${T}_{c}$ superconductor La${}_{2\ensuremath{-}x}$Sr${}_{x}$CuO${}_{4}$ probed by time-resolved optical and terahertz techniques. Physical Review B, 83(21), 214515. https://doi.org/10.1103/PhysRevB.83.214515</code></p> </blockquote> <p>which, I'm pretty sure, is not what you want. Since I have no control over the APS importer, I cannot cause it to import to something that would render somewhat correctly. I could try to detect LaTeX-ish code in existing references, but this is super tricky because e.g. a dollar sign in a title might trigger math mode for people that have no math needs (I dunno, economists?). I'm not super enthusiastic about the idea of inferring LaTeX mode from the content without further context. Which brings me to BBT's <a href="https://github.com/retorquere/zotero-better-bibtex/wiki/Unnecessarily-complicated-BibTeX-output%3F#you-are-a-hardcore-latex-user">hardcore mode</a>.</p> <p>You can provide context and things will just work. If you change the title to <code>Photoinduced melting of superconductivity in the high-<pre>${T}_{c}$</pre> superconductor <pre>La${}_{2\ensuremath{-}x}$Sr${}_{x}$CuO${}_{4}$</pre> probed by time-resolved optical and terahertz techniques</code>, BBT will pass through the bits between <code><pre>...</pre></code> entirely unchanged. The problem here is that <code><pre></code> is a BBT-only concept and might confuse Zotero's citation processor. Currently, Zotero would render it to <code>Beyer, M., Städter, D., Beck, M., Schäfer, H., Kabanov, V. V., Logvenov, G., … Demsar, J. (2011). Photoinduced melting of superconductivity in the high-<pre>${T}_{c}$</pre> superconductor <pre>La${}_{2\ensuremath{-}x}$Sr${}_{x}$CuO${}_{4}$</pre> probed by time-resolved optical and terahertz techniques. Physical Review B, 83(21), 214515. https://doi.org/10.1103/PhysRevB.83.214515</code> if you were to pull this reference into word -- but then it's not much worse than what Zotero already does with the reference as it was imported. This reference will never look good in Word because of this.</p> <p>If your reference fields just happen all to be LaTeX-safe, you can tag a reference with <code>#LaTeX</code> (case sensitive) and BBT will behave as if most of your fields are wrapped in <code><pre></code> tags (except the authors -- the authors will always get special treatment). This works on a per-reference basis.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jsdodge"><img src="https://avatars.githubusercontent.com/u/4602669?v=4" />jsdodge</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>I think I'm going to be happier sticking with titles in LaTeX format. I almost never need MS Word/OpenOffice capability.</p> <p>The help on BBT's hardcore mode says, "If you enable 'Raw BibTeX import' in the preferences, BibTeX imports will not be escaped on import, and will automatically be tagged for raw export." Could you direct me to this setting? I have selected "Retain LaTeX markup on BibTeX import" in the Advanced tab of the Better BibTeX preferences, but it doesn't automatically add the "#LaTeX" tag to references pulled in from Firefox. I don't see anything labelled "Retain LaTeX...".</p> <p>Basically, I'd like to add this tag as the default for all of my references.</p> <p>Thanks again for your attention, and I do appreciate the functionality that Better BibTeX provides.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/retorquere"><img src="https://avatars.githubusercontent.com/u/132108?v=4" />retorquere</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>That "On import" setting only works for those references where I control the import -- that is, where BibTeX references are imported using BBT. It's (apparently, I haven't tried) possible to <a href="https://www.zotero.org/support/collections_and_tags">mass-tag</a> references (search for "To assign a tag to multiple items at once" on that page), but I don't have a feature where it will always apply to all references regardless of tags.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jsdodge"><img src="https://avatars.githubusercontent.com/u/4602669?v=4" />jsdodge</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>Thanks- I used the drag-and-drop method you pointed me to to mass-tag the references with "#LaTeX", and the LaTeX markup in titles imported from the APS server is preserved.</p> <p>The only problem now is that accented characters in author names are no longer converted to LaTeX markup. If that's just the way it is, that's fine, I can deal with it. But you said that authors would always get "special treatment", which I understood to mean that they'd be converted to LaTeX markup regardless of the tag. So, if there's an automatic solution that I'm missing, please let me know. Below I return to my original example.</p> <p>With "#LaTeX" tag, the title is fine but the first author name is not converted:</p> <p>@article{Bonca2012, title = {Nonequilibrium Propagation and Decay of a Bound Pair in Driven $t$-$J$ Models}, volume = {109}, doi = {10.1103/physrevlett.109.156404}, timestamp = {2017-01-26T20:10:01Z}, number = {15}, journal = {Phys. Rev. Lett.}, author = {Bonča, J and Mierzejewski, M and Vidmar, L}, month = oct, year = {2012}, pages = {156404} }</p> <p>Without "#LaTeX" tag, the title is converted incorrectly but the first author name is converted correctly:</p> <p>@article{Bonca2012, title = {Nonequilibrium {{Propagation}} and {{Decay}} of a {{Bound Pair}} in {{Driven}} \$t\$-\${{J}}\$ {{Models}}}, volume = {109}, doi = {10.1103/physrevlett.109.156404}, timestamp = {2017-01-26T20:10:01Z}, number = {15}, journal = {Phys. Rev. Lett.}, author = {Bon{\v c}a, J and Mierzejewski, M and Vidmar, L}, month = oct, year = {2012}, pages = {156404} }</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/retorquere"><img src="https://avatars.githubusercontent.com/u/132108?v=4" />retorquere</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>The special handling is about structure parsing (like particles and such), not character translation. Dates also get special treatment in that I try to format into a recognizable date for bib(la)tex. With the #LaTeX tag, no character translation is done as the assumption is that what you have is valid LaTeX. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jsdodge"><img src="https://avatars.githubusercontent.com/u/4602669?v=4" />jsdodge</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>OK, thanks for the help.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/github-actions[bot]"><img src="https://avatars.githubusercontent.com/in/15368?v=4" />github-actions[bot]</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>