siefkenj / unified-latex

Utilities for parsing and manipulating LaTeX ASTs with the Unified.js framework
MIT License
85 stars 20 forks source link

building a bibtex plugin #65

Closed retorquere closed 6 months ago

retorquere commented 7 months ago

I'd like to take a stab at a bib(la)tex plugin. I have an existing parser which works well, but I wouldn't half mind sharing the work of upkeep 😄 . Where can I find information to get me started? Is this a sensible thing to do with unified-latex?

siefkenj commented 7 months ago

@retorquere That sounds great :-). It's my understanding that bibtex is more-or-less a completely independent language from latex? Can bibtex references be embedded in a tex document, or must they always be in a separate file?

To proceed, you first need to answer the question:

  1. Will unified-latex parse a source first and then the bibtex plugin massages the result, or will the bibtex plugin work on the source string directly.

If the answer to 1. is yes, you want to look at packages unified-latex-util-pgfkeys which take already parsed source and parse it more.

You can add a unfied-latex-util-bibtex or unified-latex-bibtex (depending on whether it is standalone or will be called by other unified-latex libs) folder in the packages directory and follow the structure of existing ones.

Let me know what other guidance you need!

retorquere commented 7 months ago

It's sort of it's own language but also not -- it parses in the playground, but it has different parsing rules for different fields, some are verbatim, most are regular, and creators have special rules, so I'm not entirely sure how to answer your question.

siefkenj commented 7 months ago

Verbatim fields are the only ones that would really be trouble. Currently verbatim fields have to be handled in the latex.peggy grammar itself, and I like to avoid special rules, if possible, in that grammar.

retorquere commented 7 months ago

Not being able to parse verbatim fields is a showstopper though. It's not a bib(la)tex parser if it doesn't.

Does the parser have any error recovery btw? My own parser will fail individual entries if they're broken but will parse the rest; a lot of bibtex found in the wild is invalid because the actual bibtex parser is ridiculously tolerant. I couldn't get it to work in pegjs though, so my own is a 3 phase parser; chunking into entries, building an ast, and reparsing the ast, as some resolution can only be done by knowing the enclosing context.

I suppose I could take my chunker, extend it to read fields, and then pass the field-content parsing to this parser. Is it possible to tell this parser to start in verbatim mode?

siefkenj commented 7 months ago

The unified-latex parser is designed to never error. It will keep things as strings if it doesn't know what to do (e.g, mismatched braces, etc.)

retorquere commented 7 months ago

That's useful, but the thing is that there's ways to know you've entered a new bibtex entry even if you are in an error state (or plain-string-returning state of you will), so it'd be useful to at that point resume non-string-return parsing of the ast. That's what I mean by error recovery. I don't want to lose all entries just because of one stray brace in the first entry.

siefkenj commented 7 months ago

Is there a way to embed bibtex in latex, or is it always in a separate file?

siefkenj commented 7 months ago

Also, could you include here some pathological examples of bibtex where verbatim parsing is required?

retorquere commented 7 months ago

Is there a way to embed bibtex in latex, or is it always in a separate file?

Sort of - you can have it in a filecontents environment.

retorquere commented 7 months ago

Also, could you include here some pathological examples of bibtex where verbatim parsing is required?

I find "pathological" a rather strange choice of words for "documented bibtex/biblatex behavior", but an url field is verbatim, among others.

retorquere commented 7 months ago
\documentclass{article}
\usepackage[style=apa, backend=biber]{biblatex}
\usepackage{url}
\begin{filecontents}{\jobname.bib}
@online{example,
  author = {Author, A.},
  year = {2022},
  title = {The Title of the Webpage},
  url = {https://www.example.com/users/~someone},
  note = {Accessed January 17, 2024},
}
\end{filecontents}
\addbibresource{\jobname.bib}
\begin{document}
\cite{example}
\printbibliography{}
\end{document}
siefkenj commented 7 months ago

Sort of - you can have it in a filecontents environment.

Yes, you could do that of course, but I don't think anyone would expect unified-latex to guess your intention when parsing the content of the filecontents environment.

So, that says to me that a biblatex parser can be independent of the latex parser, since one each file, someone will decide whether to parse latex or biblatex.

retorquere commented 7 months ago

Sure, but within the fields, it's just latex. But I take what you're saying is that it's not worth the effort to make a bibtex plugin?

retorquere commented 7 months ago

Yes, you could do that of course, but I don't think anyone would expect unified-latex to guess your intention when parsing the content of the filecontents environment.

Oh yeah agreed.

siefkenj commented 7 months ago

Sure, but within the fields, it's just latex. But I take what you're saying is that it's not worth the effort to make a bibtex plugin?

I thought you said the fields needed to be verbatim?

retorquere commented 7 months ago

I thought you said the fields needed to be verbatim?

Only some fields, but most fields are plain LaTeX, some with additional interpretation rules on top of that for a smaller subset of fields (lists and name lists). Of the sample I posted, only the url field is verbatim, but if that field is not parsed as verbatim, than the parser is not a proper bibtex parser is what I was trying to say. If all fields would be verbatim, a parser would be trivial.

The well-known/documented verbatim fields are doi, eprint, file, files, pdf, ids, url and verba, verbb, verbc, ..., jabref adds groups, and mendeley exports file as non-verbatim; preferably, which fields are to be parsed verbatim would be configurable, because variability if in-the-wild bibtex is pretty large.

siefkenj commented 7 months ago

Is something like verba={\}, url={http://foo.com/{/baz} valid bibtex?

retorquere commented 7 months ago

No, because it has unbalanced braces. Verbatim fields must still have balanced braces. This is different from latex verbatim environments, which scan for the literal input \end{verbatim}, but similar to \verb|, which can't have |.

retorquere commented 7 months ago

I can reliably distinguish fields though, so I could have a bibtex parser that just passes the field contents to unified-latex; that would also take care of error recovery. Given how trivial verbatim-mode fields are to parse, I could just pass only non-verbatim fields to unified-latex. Would that still be a plugin scenario, or would I just use unified-latex as a library? I would still have to post-process the latex token stream.

siefkenj commented 7 months ago

I think you use unified-latex as a library. Given that a file won't be latex with embedded bibtex. The caller will have to indicate that they want bibtex parsed specifically.

You can have the output format match unified-latex's AST. Here would be my proposal:

  1. Create a unified-latex-bibtex package that has a parseBibtex function whose options are a superset of the parse options from unified-latex-util-parse.
  2. Parse all fields as verbatim with your parser (or a new parser)
  3. Run the content of the non-verbatim fields through unified-latex's parse function, passing down the options as appropriate.
  4. The results can be stuffed into the same AST format that unified-latex uses. For example, you can use the escapeToken in macro to mimic a bibtex entry. E.g.
    {
    type: "macro",
    escapeToken: "@",
    content: "online",
    args: [
    {
      type: "argument",
      openBrace: "{",
      closeBrace: "}",
      content: [...]
    }
    ]
    }

The content should have = and , as individual strings and {...} as groups. If the content is verbatim, just put a single {type: "string", ...} with the content in there (it won't be mangled further). The other groups can contain the re-parsed data. If you set _renderData: {pgfkeysArgs: true} on the root macro, then you'll get nice formatting for free :-)

retorquere commented 7 months ago

I don't intend to render it, my parser converts bibtex to unicode for Zotero imports.

Where does unified-latex determine how many arguments a macro takes?

siefkenj commented 7 months ago

For pre-defined macros, an argSpec is specified in the xparse syntax. But a macro, after it's parsed, doesn't care about how many arguments it has.

retorquere commented 7 months ago

I don't know how to interpret that 2nd sentence.

retorquere commented 7 months ago

I see that a\c cb parses into

{
    "type": "root",
    "content": [
        {
            "type": "string",
            "content": "a"
        },
        {
            "type": "macro",
            "content": "c"
        },
        {
            "type": "whitespace"
        },
        {
            "type": "string",
            "content": "cb"
        }
    ]
}

where I had expected something along the lines of

{
    "type": "root",
    "content": [
        {
            "type": "string",
            "content": "a"
        },
        {
            "type": "macro",
            "content": "c",
            "args": [
                {
                    "type": "argument",
                    "content": [
                        {
                            "type": "string",
                            "content": "c"
                        }
                    ],
                }
            ]
        },
        {
            "type": "string",
            "content": "b"
        }
    ]
}
siefkenj commented 7 months ago

You must tell the parser that \c has arguments by passing macros: {c: {argspec: "m"} to the parser.

retorquere commented 7 months ago

How can I tell the parser that a macro treats its parameter as verbatim? eg \url{\x} should parse to

{ "type": "root", "content": [
  { "type": "macro", "content": "url", "args": [
    { "type": "argument", "content": [ { "type": "string", "content": "\\x" } ], "openMark": "{", "closeMark": "}" }
  ]}
]}

but in the playground it parses to

{ "type": "root", "content": [
  { "type": "macro", "content": "url", "args": [
    { "type": "argument", "content": [ { "type": "macro", "content": "x" } ], "openMark": "{", "closeMark": "}" }
  ]}
]}
siefkenj commented 7 months ago

Unfortunately that cannot be done :-(. Verbatim macros must be added to the PEG grammar itself. You can, however, post-process the content and call a printRaw to get back what should be the original contents, provided there's nothing really strange in the URL.

I wouldn't object to adding \url to the grammar. That's what we had to do for the various listing packages. You could copy the approach used by the listing macros in https://github.com/siefkenj/unified-latex/blob/4811a0b2008d67c4bad5fd53700f86f4f5202868/packages/unified-latex-util-pegjs/grammars/latex.pegjs#L193

retorquere commented 7 months ago

Same would go for the first argument of the href macro.

siefkenj commented 7 months ago

Same would go for the first argument of the href macro.

Yes, I think so.

retorquere commented 7 months ago

I wouldn't object to adding \url to the grammar. That's what we had to do for the various listing packages. You could copy the approach used by the listing macros in

When you say "you could" you mean I could prepare a PR?

retorquere commented 7 months ago

You can, however, post-process the content and call a printRaw

I don't know where I would be doing this.

siefkenj commented 7 months ago

I wouldn't object to adding \url to the grammar. That's what we had to do for the various listing packages. You could copy the approach used by the listing macros in

When you say "you could" you mean I could prepare a PR?

Yep :-)

retorquere commented 7 months ago

~ parses to { "type": "root", "content": [ { "type": "string", "content": "~" } ] }, but really ~ is more like a macro.

retorquere commented 7 months ago

Yep :-)

70

siefkenj commented 7 months ago

~ parses to { "type": "root", "content": [ { "type": "string", "content": "~" } ] }, but really ~ is more like a macro.

What do you mean?

retorquere commented 7 months ago

~ isn't the string ~, it's a non-breaking space.

siefkenj commented 7 months ago

Sometimes. It depends on the context :-). If the user wishes to do a replaceNode call to turn all {type:"string", content: "~"} into macros, they can do that. For basic parsing, it is just treated like punctuation.

retorquere commented 7 months ago

I did not know that. In what LaTeX context is it plain punctuation?

siefkenj commented 7 months ago

In expl3 syntax ~ means a regular space.

retorquere commented 7 months ago

I've mainly been playing with the pegjs grammar, which is now obviously not the right approach. Is there sample code on how to use/combine the various grammars in unified-latex-util-pegjs to parse latex into an AST? I've been searching around on the unified site but there's nothing on unified-latex there.

siefkenj commented 7 months ago

The entry point you probably want is here: https://github.com/siefkenj/unified-latex/tree/main/packages/unified-latex-util-parse

If you look in the various test.ts files, you'll see all sorts of examples of parsing with various options turned on or off.

retorquere commented 7 months ago

OK so this gets me partway:

import { unified } from "unified"
import { parse, unifiedLatexFromString } from "@unified-latex/unified-latex-util-parse";
import { replaceNode } from '@unified-latex/unified-latex-util-replace'
import { expandUnicodeLigatures } from '@unified-latex/unified-latex-util-ligatures'

// const ast = parse('<<Some text>>hello\\#ffo')
// console.log(ast)

const content = '\\url{x}{y} hello~\\#ffo'

const parser = unified().use(unifiedLatexFromString, {
    macros: {
        url: { signature: "m" },
        href: { signature: "m m" },
    },
});

const ast = parser.parse(content);
expandUnicodeLigatures(ast)
console.log(ast);

but I expected ffo to come out as ffo.

siefkenj commented 7 months ago

It appears ff is not on the list of ligatures. But I think that is the correct behavior. The ff you wrote is one character. Normally font shaping takes care of turning ligature-pairs into a different glyph for rendering only. The underlying data doesn't get manipulated. If you really want ff, you can look into the details of unified-latex-util-ligatures

retorquere commented 7 months ago

I'll consider that.

From unified-latex-util-ligatures

This only applies in non-math mode, since programs like katex will process math ligatures.

Can I make unified-latex expand ligatures in math-mode too? My output is not going to a webpage, so katex isn't going to help me. I can perhaps get it done using replaceNode but if it's just a config somewhere that'd be handy.

retorquere commented 7 months ago

Given this:

const content = '\\href{~}{y}---'
const parser = unified().use(unifiedLatexFromString, {
  macros: {
    url: { signature: "m" },
    href: { signature: "m m" },
  },
});
const ast = parser.parse(content);
expandUnicodeLigatures(ast)

how can I make sure that ~ inside the href macro isn't expanded into an NBSP? Can I mark some string nodes so they won't be touched by expandUnicodeLigatures?

siefkenj commented 7 months ago

You're going to have to recreate the expandUnicodeLigatures function with your custom decision processes. If you look through that function, you can see where it checks for math mode, etc. Try copy-pasting that code and modifying it to check if there is a macro parent with name href or url.

retorquere commented 7 months ago

How can I test for a macro in the ancestry path? And can I add new ligatures parseLigatures to deal with $_1$?

retorquere commented 7 months ago

Can expandUnicodeLigatures convert \c c to ç? Or is that a different plugin?

siefkenj commented 7 months ago

Did you look at the source code? The ligatures it understands are here: https://github.com/siefkenj/unified-latex/blob/main/packages/support-tables/ligature-macros.json and here: https://github.com/siefkenj/unified-latex/blob/main/packages/unified-latex-util-ligatures/libs/ligature-lookup.ts

siefkenj commented 7 months ago

How can I test for a macro in the ancestry path? And can I add new ligatures parseLigatures to deal with $_1$?

In math mode _ is a macro, so you cannot parse $_1$ as a ligature. You are going to have to use replaceNode(...).