Closed retorquere closed 8 months ago
@retorquere That sounds great :-). It's my understanding that bibtex is more-or-less a completely independent language from latex? Can bibtex references be embedded in a tex document, or must they always be in a separate file?
To proceed, you first need to answer the question:
If the answer to 1. is yes
, you want to look at packages unified-latex-util-pgfkeys
which take already parsed source and parse it more.
You can add a unfied-latex-util-bibtex
or unified-latex-bibtex
(depending on whether it is standalone or will be called by other unified-latex
libs) folder in the packages
directory and follow the structure of existing ones.
Let me know what other guidance you need!
It's sort of it's own language but also not -- it parses in the playground, but it has different parsing rules for different fields, some are verbatim, most are regular, and creators have special rules, so I'm not entirely sure how to answer your question.
Verbatim fields are the only ones that would really be trouble. Currently verbatim fields have to be handled in the latex.peggy
grammar itself, and I like to avoid special rules, if possible, in that grammar.
Not being able to parse verbatim fields is a showstopper though. It's not a bib(la)tex parser if it doesn't.
Does the parser have any error recovery btw? My own parser will fail individual entries if they're broken but will parse the rest; a lot of bibtex found in the wild is invalid because the actual bibtex parser is ridiculously tolerant. I couldn't get it to work in pegjs though, so my own is a 3 phase parser; chunking into entries, building an ast, and reparsing the ast, as some resolution can only be done by knowing the enclosing context.
I suppose I could take my chunker, extend it to read fields, and then pass the field-content parsing to this parser. Is it possible to tell this parser to start in verbatim mode?
The unified-latex
parser is designed to never error. It will keep things as strings if it doesn't know what to do (e.g, mismatched braces, etc.)
That's useful, but the thing is that there's ways to know you've entered a new bibtex entry even if you are in an error state (or plain-string-returning state of you will), so it'd be useful to at that point resume non-string-return parsing of the ast. That's what I mean by error recovery. I don't want to lose all entries just because of one stray brace in the first entry.
Is there a way to embed bibtex in latex, or is it always in a separate file?
Also, could you include here some pathological examples of bibtex where verbatim parsing is required?
Is there a way to embed bibtex in latex, or is it always in a separate file?
Sort of - you can have it in a filecontents environment.
Also, could you include here some pathological examples of bibtex where verbatim parsing is required?
I find "pathological" a rather strange choice of words for "documented bibtex/biblatex behavior", but an url field is verbatim, among others.
\documentclass{article}
\usepackage[style=apa, backend=biber]{biblatex}
\usepackage{url}
\begin{filecontents}{\jobname.bib}
@online{example,
author = {Author, A.},
year = {2022},
title = {The Title of the Webpage},
url = {https://www.example.com/users/~someone},
note = {Accessed January 17, 2024},
}
\end{filecontents}
\addbibresource{\jobname.bib}
\begin{document}
\cite{example}
\printbibliography{}
\end{document}
Sort of - you can have it in a filecontents environment.
Yes, you could do that of course, but I don't think anyone would expect unified-latex to guess your intention when parsing the content of the filecontents
environment.
So, that says to me that a biblatex
parser can be independent of the latex
parser, since one each file, someone will decide whether to parse latex
or biblatex
.
Sure, but within the fields, it's just latex. But I take what you're saying is that it's not worth the effort to make a bibtex plugin?
Yes, you could do that of course, but I don't think anyone would expect unified-latex to guess your intention when parsing the content of the filecontents environment.
Oh yeah agreed.
Sure, but within the fields, it's just latex. But I take what you're saying is that it's not worth the effort to make a bibtex plugin?
I thought you said the fields needed to be verbatim?
I thought you said the fields needed to be verbatim?
Only some fields, but most fields are plain LaTeX, some with additional interpretation rules on top of that for a smaller subset of fields (lists and name lists). Of the sample I posted, only the url
field is verbatim, but if that field is not parsed as verbatim, than the parser is not a proper bibtex parser is what I was trying to say. If all fields would be verbatim, a parser would be trivial.
The well-known/documented verbatim fields are doi
, eprint
, file
, files
, pdf
, ids
, url
and verba
, verbb
, verbc
, ..., jabref adds groups
, and mendeley exports file
as non-verbatim; preferably, which fields are to be parsed verbatim would be configurable, because variability if in-the-wild bibtex is pretty large.
Is something like verba={\}, url={http://foo.com/{/baz}
valid bibtex?
No, because it has unbalanced braces. Verbatim fields must still have balanced braces. This is different from latex verbatim environments, which scan for the literal input \end{verbatim}
, but similar to \verb|
, which can't have |
.
I can reliably distinguish fields though, so I could have a bibtex parser that just passes the field contents to unified-latex; that would also take care of error recovery. Given how trivial verbatim-mode fields are to parse, I could just pass only non-verbatim fields to unified-latex. Would that still be a plugin scenario, or would I just use unified-latex as a library? I would still have to post-process the latex token stream.
I think you use unified-latex
as a library. Given that a file won't be latex with embedded bibtex. The caller will have to indicate that they want bibtex parsed specifically.
You can have the output format match unified-latex
's AST. Here would be my proposal:
unified-latex-bibtex
package that has a parseBibtex
function whose options are a superset of the parse
options from unified-latex-util-parse
.parse
function, passing down the options as appropriate.unified-latex
uses. For example, you can use the escapeToken
in macro
to mimic a bibtex entry. E.g.
{
type: "macro",
escapeToken: "@",
content: "online",
args: [
{
type: "argument",
openBrace: "{",
closeBrace: "}",
content: [...]
}
]
}
The content should have =
and ,
as individual strings and {...}
as groups. If the content is verbatim, just put a single {type: "string", ...}
with the content in there (it won't be mangled further). The other groups can contain the re-parsed data. If you set _renderData: {pgfkeysArgs: true}
on the root macro, then you'll get nice formatting for free :-)
I don't intend to render it, my parser converts bibtex to unicode for Zotero imports.
Where does unified-latex
determine how many arguments a macro takes?
For pre-defined macros, an argSpec
is specified in the xparse
syntax. But a macro, after it's parsed, doesn't care about how many arguments it has.
I don't know how to interpret that 2nd sentence.
I see that a\c cb
parses into
{
"type": "root",
"content": [
{
"type": "string",
"content": "a"
},
{
"type": "macro",
"content": "c"
},
{
"type": "whitespace"
},
{
"type": "string",
"content": "cb"
}
]
}
where I had expected something along the lines of
{
"type": "root",
"content": [
{
"type": "string",
"content": "a"
},
{
"type": "macro",
"content": "c",
"args": [
{
"type": "argument",
"content": [
{
"type": "string",
"content": "c"
}
],
}
]
},
{
"type": "string",
"content": "b"
}
]
}
You must tell the parser that \c
has arguments by passing macros: {c: {argspec: "m"}
to the parser.
How can I tell the parser that a macro treats its parameter as verbatim? eg \url{\x}
should parse to
{ "type": "root", "content": [
{ "type": "macro", "content": "url", "args": [
{ "type": "argument", "content": [ { "type": "string", "content": "\\x" } ], "openMark": "{", "closeMark": "}" }
]}
]}
but in the playground it parses to
{ "type": "root", "content": [
{ "type": "macro", "content": "url", "args": [
{ "type": "argument", "content": [ { "type": "macro", "content": "x" } ], "openMark": "{", "closeMark": "}" }
]}
]}
Unfortunately that cannot be done :-(. Verbatim macros must be added to the PEG grammar itself. You can, however, post-process the content and call a printRaw
to get back what should be the original contents, provided there's nothing really strange in the URL.
I wouldn't object to adding \url
to the grammar. That's what we had to do for the various listing
packages. You could copy the approach used by the listing
macros in https://github.com/siefkenj/unified-latex/blob/4811a0b2008d67c4bad5fd53700f86f4f5202868/packages/unified-latex-util-pegjs/grammars/latex.pegjs#L193
Same would go for the first argument of the href
macro.
Same would go for the first argument of the
href
macro.
Yes, I think so.
I wouldn't object to adding
\url
to the grammar. That's what we had to do for the variouslisting
packages. You could copy the approach used by thelisting
macros in
When you say "you could" you mean I could prepare a PR?
You can, however, post-process the content and call a
printRaw
I don't know where I would be doing this.
I wouldn't object to adding
\url
to the grammar. That's what we had to do for the variouslisting
packages. You could copy the approach used by thelisting
macros inWhen you say "you could" you mean I could prepare a PR?
Yep :-)
~
parses to { "type": "root", "content": [ { "type": "string", "content": "~" } ] }
, but really ~
is more like a macro.
Yep :-)
~
parses to{ "type": "root", "content": [ { "type": "string", "content": "~" } ] }
, but really~
is more like a macro.
What do you mean?
~
isn't the string ~
, it's a non-breaking space.
Sometimes. It depends on the context :-). If the user wishes to do a replaceNode
call to turn all {type:"string", content: "~"}
into macros, they can do that. For basic parsing, it is just treated like punctuation.
I did not know that. In what LaTeX context is it plain punctuation?
In expl3
syntax ~
means a regular space.
I've mainly been playing with the pegjs grammar, which is now obviously not the right approach. Is there sample code on how to use/combine the various grammars in unified-latex-util-pegjs to parse latex into an AST? I've been searching around on the unified site but there's nothing on unified-latex there.
The entry point you probably want is here: https://github.com/siefkenj/unified-latex/tree/main/packages/unified-latex-util-parse
If you look in the various test.ts
files, you'll see all sorts of examples of parsing with various options turned on or off.
OK so this gets me partway:
import { unified } from "unified"
import { parse, unifiedLatexFromString } from "@unified-latex/unified-latex-util-parse";
import { replaceNode } from '@unified-latex/unified-latex-util-replace'
import { expandUnicodeLigatures } from '@unified-latex/unified-latex-util-ligatures'
// const ast = parse('<<Some text>>hello\\#ffo')
// console.log(ast)
const content = '\\url{x}{y} hello~\\#ffo'
const parser = unified().use(unifiedLatexFromString, {
macros: {
url: { signature: "m" },
href: { signature: "m m" },
},
});
const ast = parser.parse(content);
expandUnicodeLigatures(ast)
console.log(ast);
but I expected ffo
to come out as ffo
.
It appears ff
is not on the list of ligatures. But I think that is the correct behavior. The ff
you wrote is one character. Normally font shaping takes care of turning ligature-pairs into a different glyph for rendering only. The underlying data doesn't get manipulated. If you really want ff
, you can look into the details of unified-latex-util-ligatures
I'll consider that.
From unified-latex-util-ligatures
This only applies in non-math mode, since programs like katex will process math ligatures.
Can I make unified-latex expand ligatures in math-mode too? My output is not going to a webpage, so katex
isn't going to help me. I can perhaps get it done using replaceNode
but if it's just a config somewhere that'd be handy.
Given this:
const content = '\\href{~}{y}---'
const parser = unified().use(unifiedLatexFromString, {
macros: {
url: { signature: "m" },
href: { signature: "m m" },
},
});
const ast = parser.parse(content);
expandUnicodeLigatures(ast)
how can I make sure that ~
inside the href
macro isn't expanded into an NBSP? Can I mark some string
nodes so they won't be touched by expandUnicodeLigatures
?
You're going to have to recreate the expandUnicodeLigatures
function with your custom decision processes. If you look through that function, you can see where it checks for math mode, etc. Try copy-pasting that code and modifying it to check if there is a macro parent with name href
or url
.
How can I test for a macro in the ancestry path? And can I add new ligatures parseLigatures
to deal with $_1$
?
Can expandUnicodeLigatures
convert \c c
to ç
? Or is that a different plugin?
Did you look at the source code? The ligatures it understands are here: https://github.com/siefkenj/unified-latex/blob/main/packages/support-tables/ligature-macros.json and here: https://github.com/siefkenj/unified-latex/blob/main/packages/unified-latex-util-ligatures/libs/ligature-lookup.ts
How can I test for a macro in the ancestry path? And can I add new ligatures
parseLigatures
to deal with$_1$
?
In math mode _
is a macro, so you cannot parse $_1$
as a ligature. You are going to have to use replaceNode(...)
.
I'd like to take a stab at a bib(la)tex plugin. I have an existing parser which works well, but I wouldn't half mind sharing the work of upkeep 😄 . Where can I find information to get me started? Is this a sensible thing to do with unified-latex?