Closed siddhartha-gadgil closed 1 year ago
Instead of a bunch of replacements, it is better to use Lean's substrings and look for sequences of characters beginning with a single backslash. We can split into such sequences and the regions in between (as is done for identifiers), replace and rebuild.
When dealing with strings, LaTeX commands like \mathbb
and \frac
give invalid escape sequence
errors since they begin with a backslash (changing to double-backslash for LaTeX formulas in strings - such as \\mathbb
and \\frac
- fixes this).
No such issue arises in Lean comments.
@0art0 Good point. We should have special rules for these.
Instead of a bunch of replacements, it is better to use Lean's substrings and look for sequences of characters beginning with a single backslash. We can split into such sequences and the regions in between (as is done for identifiers), replace and rebuild.
Building on this, we can extract the portions of the docstring enclosed in $
s, translate them partially or completely to Lean syntax with Unicode using a combination of rule-based and Codex translation, and then place them back with backticks instead of dollars.
I have started making a list of LaTeX syntax that can be mapped to Lean code programmatically. Others are welcome to edit this message to expand this list.
These are a few common cases I could think of where rule-based mapping of syntax may be difficult. They can be handled using Codex and appropriate input-dependent prompting.
For most of these, we should let Codex do the work and map minimally. You can see in comments in #2 that Codex handled finite sequences fine. At least in this round, we will not build elaborate rule-based translations but only have rules where a substitution will lead to the wrong meaning.
Also note that we have a large file of transformations. This includes inequalities, subsets and a lot more.
For inequalities and subsets, we must also handle iterated versions, i.e., $a \leq b \leq c$ or $A \subset B \subset C \subset D$.
For inequalities and subsets, we must also handle iterated versions, i.e., a≤b≤c or A⊂B⊂C⊂D.
For this it is best to add syntax in Lean. I tried this for a few minutes and it did not work, but with help from Lean 4 Zulip it should be possible.
Looks like Chat-GPT does this fine. So at most it looks like two steps for GPT-3.5-turbo or GPT-4. It even handles stuff like $\frac{x}{y}$
and $\mathbb{R}$.
$$
and$
with backticksThis is a step for #2