leanprover / doc-gen4

Document Generator for Lean 4
Apache License 2.0
62 stars 41 forks source link

(obsolete) feat: add bibliography support #200

Closed acmepjz closed 1 month ago

acmepjz commented 2 months ago

This PR adds bibliography support to doc-gen4. It includes:

To use it in mathlib, the references.bib in mathlib needs to be changed slightly, as it contains something unrecognized by pybtex. A PR to mathlib will be open later.

This PR has all commit history squashed. To see the full history, see the branch https://github.com/acmepjz/doc-gen4/tree/test2.

acmepjz commented 2 months ago

Screenshots:

2024-06-20-034859

2024-06-22-044049

screenshot

eric-wieser commented 2 months ago

Maybe the python compromise here is to have doc-gen consume the json file, and let mathlib convert its bib file to json (via python) during the build?

acmepjz commented 2 months ago

Maybe the python compromise here is to have doc-gen consume the json file, and let mathlib convert its bib file to json (via python) during the build?

I think this is also OK if current mathlib build process already uses python. I can change the build script to recognize both references.json and references.bib in the docs path.

hargoniX commented 2 months ago

This just amounts to changing the feature from "use python and it's integrated with doc-gen" to "use python and it's less integrated with doc-gen". That's not acceptable, the solution needs to be properly integrated with doc-gen and should either use only Lean or if someone has a compelling argument for why it has to be pybtex Python.

acmepjz commented 2 months ago

This just amounts to changing the feature from "use python and it's integrated with doc-gen" to "use python and it's less integrated with doc-gen". That's not acceptable, the solution needs to be properly integrated with doc-gen and should either use only Lean or if someone has a compelling argument for why it has to be pybtex Python.

No, the user is not needed to use python, if they can generate references.json by other means. (In fact pybtex itself cannot generate that file either, it's obtained from postprocessing the output HTML of pybtex.) In fact the format of this file is very straightforward (see https://github.com/acmepjz/doc-gen4/blob/ef68e7450a414b38a3de50f9b3c0a5f3249cd0b9/DocGen4/Output/Base.lean#L18), it is an array of objects which consists of 4 fields:

/-- The structure representing a processed bibitem. -/
structure BibItem where
  /-- The cite key as in the bib file. -/
  citekey : String
  /-- The tag generated by bib processor, e.g. `[Doe12]`.
  Should be plain text and should not be escaped. -/
  tag : String
  /-- The HTML generated by bib processor, e.g.
  `John Doe. <i>Test</i>. 2012.`-/
  html : String
  /-- The plain text form of `html` field. Should not be escaped. -/
  plaintext : String
  deriving FromJson, ToJson
hargoniX commented 2 months ago

Yes that is the precise issue, the whole feature will be less integrated. There will be no trivial way to generate the .json file apart from what you described. Hence you are effectively forcing people to use the solution that I am trying to avoid here.

eric-wieser commented 2 months ago

Users could write the json file by hand if wanted. Agreed that's less convenient than doing it in .bib, but I think mathematical users would much prefer to be able to include references in their documentation today at the expense of either running pybtex once or adding to CI, than the status quo since lean 3 of not being able to have them at all. The compromise I suggested was intended to avoid having doc-gen itself take the brunt of the dependency management.

Either way, I think there's a lot of good stuff in this PR that can be cleaned up and reused whichever way this python thing goes, so maybe discussion here should focus on the other technical aspects and leave the Python discussion to an issue / zulip?

acmepjz commented 2 months ago

... the only thing that's left is to replace pybtex with the Lean Bibtex parser. Note that I did not take a look at Pybtex.lean for that reason.

After the bibtex parser has removed the batteries dependency we can just pull it in here as a dependency already and implement the feature, then merge this PR. Once the bibtex parser gets strong enough to understand mathlib's references.bib we should be able to simply upgrade it and mathlib gets the feature.

I should say that we need a bibtex renderer, not only parser. There are several things need to be done, for example (1) generate abbreviations from author names and year, which is not trivial (how to generate abbreviations of last names? there are a lot of corner cases); (2) generate the output from bibitem, again there are a lot of corner cases, for example if there are no authors, but only editors, then use editors instead, etc. etc...

acmepjz commented 2 months ago

Oops. What's wrong with the build script? I think it has not changed since the last successful build.

hargoniX commented 2 months ago

Yes but batteries changed, I fixed this on main. If you rebase this branch on top of it it should work again.

hargoniX commented 2 months ago

You merged the branch, please do a rebase instead, I prefer to keep a linear history in doc-gen if possible (which it is here)

acmepjz commented 2 months ago

You merged the branch, please do a rebase instead, I prefer to keep a linear history in doc-gen if possible (which it is here)

Sorry for that. I think it's better for me to re-open a new PR when all of this are done; now some of the codes are still WIP.

hargoniX commented 2 months ago

Just to make this clear: I will not merge code that uses pybtex for the reasons outlined in previous discussions, putting effort into documenting it or exploring optimizations such as stdin/stdout does not seem to be a productive use of time for me.

hargoniX commented 2 months ago

This is not to say that this PR has generally no chance of being merged, as previously explained: All of the infrastructure that you set up here is good after the rounds of review now. However we have the capability to do BibTex things ourselves with Bibtex.lean so I will not allow in a Python dependency or workarounds that don't explicitly say but effectively force a Python dependency. If you make this work with Bibtex.lean I am going to merge the PR.

acmepjz commented 2 months ago

I'll re-open a PR without Python alternatives when the Bibtex.lean is done.

acmepjz commented 1 month ago

I think I'll close this PR. The branch will be kept as a historical reference.