siddhartha-gadgil commented 2 years ago

We only translate at symbol level, leaving underscores etc untouched.
We also replace $$ and $ with backticks
A useful database is https://github.com/leanprover/vscode-lean4/blob/master/vscode-lean4/src/abbreviation/abbreviations.json

This is a step for #2

siddhartha-gadgil commented 2 years ago

Instead of a bunch of replacements, it is better to use Lean's substrings and look for sequences of characters beginning with a single backslash. We can split into such sequences and the regions in between (as is done for identifiers), replace and rebuild.

0art0 commented 2 years ago

When dealing with strings, LaTeX commands like \mathbb and \frac give invalid escape sequence errors since they begin with a backslash (changing to double-backslash for LaTeX formulas in strings - such as \\mathbb and \\frac - fixes this).

No such issue arises in Lean comments.

siddhartha-gadgil commented 2 years ago

@0art0 Good point. We should have special rules for these.

0art0 commented 2 years ago

Instead of a bunch of replacements, it is better to use Lean's substrings and look for sequences of characters beginning with a single backslash. We can split into such sequences and the regions in between (as is done for identifiers), replace and rebuild.

Building on this, we can extract the portions of the docstring enclosed in $s, translate them partially or completely to Lean syntax with Unicode using a combination of rule-based and Codex translation, and then place them back with backticks instead of dollars.

0art0 commented 2 years ago

I have started making a list of LaTeX syntax that can be mapped to Lean code programmatically. Others are welcome to edit this message to expand this list.

Rule-based translations

Inequalities ($\le$, $\leq$, $a \leq b \leq c$)
Subsets ($A \subset B$, $A \subset B \subset C$)
Binomial coefficients ($\binom{n}{k}$, $n \choose k$)
Factorials
Fractions ($\frac{1}{2}$)
Roots ($\sqrt{4}$, $\sqrt[3]{5}$)
Powers, exponents
Floor, ceiling, absolute value functions
Intervals ($[0, 1]$, $(2, 3]$)
Unions and intersections (both finite and infinite)
Embellished names ($\mathbb{R}, $\Hom$, $\operatorname{Int}$)

0art0 commented 2 years ago

These are a few common cases I could think of where rule-based mapping of syntax may be difficult. They can be handled using Codex and appropriate input-dependent prompting.

Input-dependent prompting

Finite sums and products ($\sum{n = 1}^{10} n^2$, $x{1} \cdot x{2} \cdot \cdots \cdot x{n}$)
Infinite sums and products ($\sum_{n=1}^{\infty} \frac{1}{n^2}$)
Limits ($\lim_{n \to \infty} \frac{1}{n}$)
Multiplication without dot ($ab = ba$) (Comment: Since this is based on the lack of a pattern, there is no clear way to do input-dependent prompting. We can just include a few fix ed prompts to handle this kind of syntax)
Finite sequences ($a{1}, a{2}, \ldots, a_{n}$)
Big O notation
Integrals ($\int_{0}^{5} 2x dx$)
Subscripts (for $t \in [0, 1]$, $f_{t}$ is a continuous map from $X$ to $Y$)
Modular arithmetic/Congruence ($2 \equiv 7 \pmod 5$)
Exists unique ($\exists!$)

siddhartha-gadgil commented 2 years ago

For most of these, we should let Codex do the work and map minimally. You can see in comments in #2 that Codex handled finite sequences fine. At least in this round, we will not build elaborate rule-based translations but only have rules where a substitution will lead to the wrong meaning.

siddhartha-gadgil commented 2 years ago

Also note that we have a large file of transformations. This includes inequalities, subsets and a lot more.

0art0 commented 2 years ago

For inequalities and subsets, we must also handle iterated versions, i.e., $a \leq b \leq c$ or $A \subset B \subset C \subset D$.

siddhartha-gadgil commented 2 years ago

For inequalities and subsets, we must also handle iterated versions, i.e., a≤b≤c or A⊂B⊂C⊂D.

For this it is best to add syntax in Lean. I tried this for a few minutes and it did not work, but with help from Lean 4 Zulip it should be possible.

siddhartha-gadgil commented 1 year ago

Looks like Chat-GPT does this fine. So at most it looks like two steps for GPT-3.5-turbo or GPT-4. It even handles stuff like $\frac{x}{y}$ and $\mathbb{R}$.

siddhartha-gadgil / LeanAide

LaTeX converted partially to unicode #6

Rule-based translations

Input-dependent prompting