siddhartha-gadgil / LeanAide

Tools based on AI for helping with Lean 4
Apache License 2.0
65 stars 5 forks source link

LaTeX converted partially to unicode #6

Closed siddhartha-gadgil closed 1 year ago

siddhartha-gadgil commented 2 years ago

This is a step for #2

siddhartha-gadgil commented 2 years ago

Instead of a bunch of replacements, it is better to use Lean's substrings and look for sequences of characters beginning with a single backslash. We can split into such sequences and the regions in between (as is done for identifiers), replace and rebuild.

0art0 commented 2 years ago

When dealing with strings, LaTeX commands like \mathbb and \frac give invalid escape sequence errors since they begin with a backslash (changing to double-backslash for LaTeX formulas in strings - such as \\mathbb and \\frac - fixes this).

No such issue arises in Lean comments.

siddhartha-gadgil commented 2 years ago

@0art0 Good point. We should have special rules for these.

0art0 commented 2 years ago

Instead of a bunch of replacements, it is better to use Lean's substrings and look for sequences of characters beginning with a single backslash. We can split into such sequences and the regions in between (as is done for identifiers), replace and rebuild.

Building on this, we can extract the portions of the docstring enclosed in $s, translate them partially or completely to Lean syntax with Unicode using a combination of rule-based and Codex translation, and then place them back with backticks instead of dollars.

0art0 commented 2 years ago

I have started making a list of LaTeX syntax that can be mapped to Lean code programmatically. Others are welcome to edit this message to expand this list.

Rule-based translations

0art0 commented 2 years ago

These are a few common cases I could think of where rule-based mapping of syntax may be difficult. They can be handled using Codex and appropriate input-dependent prompting.

Input-dependent prompting

siddhartha-gadgil commented 2 years ago

For most of these, we should let Codex do the work and map minimally. You can see in comments in #2 that Codex handled finite sequences fine. At least in this round, we will not build elaborate rule-based translations but only have rules where a substitution will lead to the wrong meaning.

siddhartha-gadgil commented 2 years ago

Also note that we have a large file of transformations. This includes inequalities, subsets and a lot more.

0art0 commented 2 years ago

For inequalities and subsets, we must also handle iterated versions, i.e., $a \leq b \leq c$ or $A \subset B \subset C \subset D$.

siddhartha-gadgil commented 2 years ago

For inequalities and subsets, we must also handle iterated versions, i.e., a≤b≤c or A⊂B⊂C⊂D.

For this it is best to add syntax in Lean. I tried this for a few minutes and it did not work, but with help from Lean 4 Zulip it should be possible.

siddhartha-gadgil commented 1 year ago

Looks like Chat-GPT does this fine. So at most it looks like two steps for GPT-3.5-turbo or GPT-4. It even handles stuff like $\frac{x}{y}$ and $\mathbb{R}$.