leanprover / doc-gen4

Document Generator for Lean 4
Apache License 2.0
62 stars 41 forks source link

feat: change the markdown parser from `cmark` to `md4c` #193

Closed acmepjz closed 2 months ago

acmepjz commented 2 months ago

The md4c markdown parser (https://github.com/mity/md4c) has built-in feature for LaTeX formula and tables, etc. A Lean wrapper for md4c (https://github.com/acmepjz/md4lean) is created. Testing shows that it fixes LaTeX formula render problems, see https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Contribution.20to.20doc-gen4.3F/near/443883080.

Since this is a big infrastructure change, more extensive tests are needed.

Closes #178.

hargoniX commented 2 months ago

This is some great work! I'll look into some more intensive testing later today but a few more general things I'd like to see apart from that:

  1. Please stick to our commit convention: https://lean-lang.org/lean4/doc/dev/commit_convention.html
  2. Please do get rid of the membuf thing in favor of a lean_string directly. It just makes the FFI a bit more complicated than it needs to be.
  3. I'm guessing we'll wait for https://github.com/acmepjz/md4lean/pull/1 before merging this?
  4. Can I have commit access to your repository? The reason that I'm asking is, that if we do a Lean version upgrade and something in your repo breaks you end up becoming the critical path on that upgrade. If you are not available to fix that right away (which I can understand) I would just fork your repo and fix it in my own fork. The principle of having push access to CMark.lean instead of having a fork worked quite well so I would like to keep it.
  5. I see that you have a little bit of diff compared to upstream md4c. Do you plan on upstreaming this? Or will we just forever have this mini fork laying around.
eric-wieser commented 2 months ago

2. Please do get rid of the membuf thing in favor of a lean_string directly. It just makes the FFI a bit more complicated than it needs to be.

The lean_string API does not allow appending a byte array to a a string object without:

Using Lean's buffer.h would be a better translation, but then this C code would be C++ and I don't know if that's harder to build with lake.

acmepjz commented 2 months ago
  1. Please stick to our commit convention: https://lean-lang.org/lean4/doc/dev/commit_convention.html

I have made a slight change to the description of this PR. Is that correct? I think this is not quite a feat, but I don't know what exactly it is.

  1. Please do get rid of the membuf thing in favor of a lean_string directly. It just makes the FFI a bit more complicated than it needs to be.

I'm still investigating.

  1. I'm guessing we'll wait for use an Option String return value acmepjz/md4lean#1 before merging this?

Done.

  1. Can I have commit access to your repository? The reason that I'm asking is, that if we do a Lean version upgrade and something in your repo breaks you end up becoming the critical path on that upgrade. If you are not available to fix that right away (which I can understand) I would just fork your repo and fix it in my own fork. The principle of having push access to CMark.lean instead of having a fork worked quite well so I would like to keep it.

Sure. Do you think it's better if I transfer the ownership of the repository to leanprover?

  1. I see that you have a little bit of diff compared to upstream md4c. Do you plan on upstreaming this? Or will we just forever have this mini fork laying around.

I think it's a good feature for the upstream. But let me ask them if they want this feature. Meanwhile, it's suggested here https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Contribution.20to.20doc-gen4.3F/near/443948345 that it's better to just use the original behavior <x-equation> of md4c. What do you think?

acmepjz commented 2 months ago

The lean_string API does not allow appending a byte array to a a string object without:

  • Making a new heap allocation for the byte array
  • appending a new \0 terminator every time

We have a lean_mk_string_from_bytes. Let me check if it can be used.

Using Lean's buffer.h would be a better translation, but then this C code would be C++ and I don't know if that's harder to build with lake.

Are these exported publicly? If not, then I'm afraid that they cannot be called outside.

hargoniX commented 2 months ago

Sure. Do you think it's better if I transfer the ownership of the repository to leanprover?

I don't thing we should do that, this would make the repository somewhat official and people might start using it, expecting that it gets proper maintenance etc. Given that this is only a temporary solution with Verso at the horizon I would like to avoid that.

that it's better to just use the original behavior of md4c. What do you think?

I don't have a strong opinion on this as long as it produces results that people are content with.

The lean_string API does not allow appending a byte array to a a string object without:

In that case you can use the Lean byte array API and later call the functions to make a string from it I guess?

acmepjz commented 2 months ago
  1. Please do get rid of the membuf thing in favor of a lean_string directly. It just makes the FFI a bit more complicated than it needs to be.

Done in https://github.com/acmepjz/md4lean/blob/main/wrapper/wrapper.c but please review the code to see if it's correct.

  1. Can I have commit access to your repository? The reason that I'm asking is, that if we do a Lean version upgrade and something in your repo breaks you end up becoming the critical path on that upgrade. If you are not available to fix that right away (which I can understand) I would just fork your repo and fix it in my own fork. The principle of having push access to CMark.lean instead of having a fork worked quite well so I would like to keep it.

Invitation sent.

hargoniX commented 2 months ago

Okay all of this looks good so far. Does someone have a list of files in mathlib at hand so we can check if they render better/fine now?

acmepjz commented 2 months ago

https://leanprover-community.github.io/mathlib4_docs/Mathlib/Algebra/Module/Submodule/Map.html#Submodule.compatibleMaps mentioned in #178

https://leanprover-community.github.io/mathlib4_docs/Mathlib/AlgebraicGeometry/EllipticCurve/DivisionPolynomial/Basic.html mentioned in https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Contribution.20to.20doc-gen4.3F/near/443833254

hargoniX commented 2 months ago

Built the docs locally, it looks good to me now :+1:. I'm going to merge this unless anyone has additional concerns.

eric-wieser commented 2 months ago

In that case you can use the Lean byte array API and later call the functions to make a string from it I guess?

This sounds like a good idea to me, but certainly the payoff of doing the work is pretty marginal.