lukas-blecher / LaTeX-OCR

pix2tex: Using a ViT to convert images of equations into LaTeX code.
https://lukas-blecher.github.io/LaTeX-OCR/
MIT License
12.63k stars 1.03k forks source link

how to create new dataset for testing? #67

Closed aspnetcs closed 2 years ago

aspnetcs commented 2 years ago

how to create new dataset for testing?

lukas-blecher commented 2 years ago

Please elaborate. From what data? In principle you need a text file with normalized latex equations and corresponding images that are named in such a way, that each image file name matches a line in the text file. Example: 0000.png - first line 0001.png - second line ...

If you need to scrape more data you can look into the methods I wrote in dataset/scraping.py.

aspnetcs commented 2 years ago

How to convert the data in this URL into your pkl format?

https://github.com/LinXueyuanStdio/Data-for-LaTeX_OCR/tree/d8dd211270746a86caf85cbe5aab93f2a4bee0df

--

At 2021-12-23 19:22:41, "Lukas Blecher" @.***> wrote:

Please elaborate. From what data? In principle you need a text file with normalized latex equations and corresponding images that are named in such a way, that each image file name matches a line in the text file. Example: 0000.png - first line 0001.png - second line ...

If you need to scrape more data you can look into the methods I wrote in dataset/scraping.py.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

As far as I can see it is straight forward. For the small dataset the training pkl file would be

python dataset/dataset.py --equations data/small/formulas/train.formulas.norm.txt --images data/small/images_train --tokenizer dataset/tokenizer.json --out data/small/train.pkl

This only holds true while the matching is trivial.

Note: the images are in a differently sized. I've opted to pad each dimension to a multiple of 32. They chose for height: 20 and width 80. There is the option to set pad to true in the config file but it is much slower doing it in real time than to preprocess the images. Use https://github.com/lukas-blecher/LaTeX-OCR/blob/ba1b7285799f0ee3b78925029e7e521444974a71/utils/utils.py#L73-L104

aspnetcs commented 2 years ago

python dataset/dataset.py --equations dataset/data/preprocessx/Data-for-LaTeX_OCR/full/formulas/train.formulas.norm.txt --images dataset/data/preprocessx/Data-for-LaTeX_OCR/full/images_train --tokenizer dataset/tokenizer.json --out dataset/data/preprocessx/Data-for-LaTeX_OCR/full/train_full.pkl

python dataset/dataset.py --equations dataset/data/preprocessx/Data-for-LaTeX_OCR/small/formulas/train.formulas.norm.txt --images dataset/data/preprocessx/Data-for-LaTeX_OCR/small/images_train --tokenizer dataset/tokenizer.json --out dataset/data/preprocessx/Data-for-LaTeX_OCR/small/train_samll.pkl

These two commands generate train_full.pkl and train_samll.pkl respectively, and their sizes are both 27576. Are the results wrong?

Screenshot of the result is as follow: (tf_1.12) @.:/home/code/LaTeX-OCR/dataset/data/preprocessx/Data-for-LaTeX_OCR/small# ll total 60 drwxr-xr-x 5 root root 4096 Dec 29 09:36 ./ drwxr-xr-x 6 root root 4096 Aug 27 2019 ../ -rw-r--r-- 1 root root 592 Aug 27 2019 README.md -rw-r--r-- 1 root root 1114 Aug 27 2019 data.json drwxr-xr-x 2 root root 4096 Aug 27 2019 formulas/ drwxr-xr-x 5 root root 4096 Aug 27 2019 images/ drwxr-xr-x 2 root root 4096 Aug 27 2019 matching/ -rw-r--r-- 1 root root 27576 Dec 29 09:36 train_small.pkl -rw-r--r-- 1 root root 174 Aug 27 2019 vocab.json (tf_1.12) @.:/home/code/LaTeX-OCR/dataset/data/preprocessx/Data-for-LaTeX_OCR/small# cd .. (tf_1.12) @.:/home/code/LaTeX-OCR/dataset/data/preprocessx/Data-for-LaTeX_OCR# cd full (tf_1.12) @.:/home/code/LaTeX-OCR/dataset/data/preprocessx/Data-for-LaTeX_OCR/full# ll total 68 drwxr-xr-x 5 root root 4096 Dec 29 13:31 ./ drwxr-xr-x 6 root root 4096 Aug 27 2019 ../ -rw-r--r-- 1 root root 6148 Aug 27 2019 .DS_Store -rw-r--r-- 1 root root 613 Aug 27 2019 README.md -rw-r--r-- 1 root root 1077 Aug 27 2019 data.json drwxr-xr-x 2 root root 4096 Aug 27 2019 formulas/ drwxr-xr-x 5 root root 4096 Aug 27 2019 images/ drwxr-xr-x 2 root root 4096 Aug 27 2019 matching/ -rw-r--r-- 1 root root 27576 Dec 29 13:31 train_full.pkl -rw-r--r-- 1 root root 173 Aug 27 2019 vocab.json (tf_1.12) @.***:/home/code/LaTeX-OCR/dataset/data/preprocessx/Data-for-LaTeX_OCR/full#

--

At 2021-12-28 21:54:13, "Lukas Blecher" @.***> wrote:

Closed #67.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

They should be diffently sized if the amount of data is different. looks like the path to the images is wrong

python dataset/dataset.py --equations dataset/data/preprocessx/Data-for-LaTeX_OCR/full/formulas/train.formulas.norm.txt --images dataset/data/preprocessx/Data-for-LaTeX_OCR/full/images/images_train --tokenizer dataset/tokenizer.json --out dataset/data/preprocessx/Data-for-LaTeX_OCR/full/train_full.pkl
aspnetcs commented 2 years ago

How to use pytorch's LBFGS algorithm in your LaTeX-OCR project?

--

At 2021-12-28 21:54:13, "Lukas Blecher" @.***> wrote:

Closed #67.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

Change https://github.com/lukas-blecher/LaTeX-OCR/blob/97c9644e776862234f71c0913b9b2a4e8da2fc15/train.py#L38 to

 opt = optim.LBFGS(model.parameters())
aspnetcs commented 2 years ago

Thank you,you are a great man.

"The model consist of a ViT [1] encoder with a ResNet backbone and a Transformer [2] decoder." in your latex-ocr project(https://github.com/lukas-blecher/LaTeX-OCR) Would you like to give me the reference articles( ViT [1] and Transformer [2] )?

--

At 2022-01-07 19:29:33, "Lukas Blecher" @.***> wrote:

Change https://github.com/lukas-blecher/LaTeX-OCR/blob/97c9644e776862234f71c0913b9b2a4e8da2fc15/train.py#L38 to

opt=optim.LBFGS(model.parameters())

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

Haha thanks.

[1] and [2] are listed at the end of the readme: https://github.com/lukas-blecher/LaTeX-OCR#references

aspnetcs commented 2 years ago

I want to convert this fullhand dataset (https://github.com/LinXueyuanStdio/Data-for-LaTeX_OCR/tree/d8dd211270746a86caf85cbe5aab93f2a4bee0df/fullhand) into a pkl file and the following error occurred. How to fix it?

C:\Users\demo\Desktop\im2latex\LaTeX-OCR-main>python dataset/dataset.py --equations C:\Users\demo\Desktop\im2latex\latex-ocr-datasets\Data-for-LaTeX_OCR\fullhand\formulas\formulas.norm.txt --images C:\Users\demo\Desktop\im2latex\latex-ocr-datasets\Data-for-LaTeX_OCR\fullhand\images --tokenizer dataset/tokenizer.json --out fullhand.pkl

Generate dataset

0%| | 5/99552 [00:00<3:58:57, 6.94it/s]

Traceback (most recent call last):

File "dataset/dataset.py", line 247, in

Im2LatexDataset(args.equations, args.images, args.tokenizer).save(args.out)

File "dataset/dataset.py", line 101, in init

self.data[(width, height)].append((eqs[self.indices[i]], im))

IndexError: list index out of range

C:\Users\demo\Desktop\im2latex\LaTeX-OCR-main>

--

At 2022-01-08 20:57:48, "Lukas Blecher" @.***> wrote:

Haha thanks.

[1] and [2] are listed at the end of the readme: https://github.com/lukas-blecher/LaTeX-OCR#references

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

This is because the matching is not trivial. you would need to create a lookup table, like so

def read_matches(line):
    img, ind = line.split(' ')
    img = int(img.split('.')[0])
    ind = int(ind)
    return img, ind

with open('training.matching.txt', 'r') as f:
    imgs, inds = [], []
    for line in f.readlines():
        img, ind = read_matches(line)
        imgs.append(img)
        inds.append(ind)

and in the dataset file you need to use that information

ind = inds[imgs.index(self.indices[i])]
self.data[(width, height)].append((eqs[ind], im))

I've not tested it, so there can be mistakes. But that's the direction you have to go

aspnetcs commented 2 years ago

Would you like to add tensorboard to your project(lukas-blecher/LaTeX-OCR)? In order to visualize the running process and results.

--

At 2022-01-08 20:57:48, "Lukas Blecher" @.***> wrote:

Haha thanks.

[1] and [2] are listed at the end of the readme: https://github.com/lukas-blecher/LaTeX-OCR#references

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

I already have a weights and biases integration in place which you also can host locally.

aspnetcs commented 2 years ago

How should I run this project in order to visualize the deep network structure used?

--

At 2022-01-13 18:52:55, "Lukas Blecher" @.***> wrote:

I already have a weights and biases integration in place which you also can host locally.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

I don't know. there is nothing in place for that in this project.

aspnetcs commented 2 years ago

@.***:~/LaTeX-OCR-main# python gui.py

qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.

This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, webgl, xcb.

Aborted (core dumped)

@.***:~/LaTeX-OCR-main#

--

At 2022-01-13 18:52:55, "Lukas Blecher" @.***> wrote:

I already have a weights and biases integration in place which you also can host locally.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

aspnetcs commented 2 years ago

https://www.cnblogs.com/keng333/p/14328144.html apt-get install libxcb-xinerama0 export QTWEBENGINE_DISABLE_SANDBOX=1 export XDG_RUNTIME_DIR=/usr/lib/ export RUNLEVEL=3

--

At 2022-01-15 00:55:47, "Lukas Blecher" @.***> wrote:

I don't know. there is nothing in place for that in this project.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

aspnetcs commented 2 years ago

The formula in the attachment cannot be recognized as latex code. How to add new formula data to the original data set?

--

At 2022-01-09 20:09:49, "Lukas Blecher" @.***> wrote:

This is because the matching is not trivial. you would need to create a lookup table, like so

defread_matches(line): img, ind=line.split(' ') img=int(img.split('.')[0]) ind=int(ind) returnimg, indwithopen('training.matching.txt', 'r') asf: imgs, inds= [], [] forlineinf.readlines(): img, ind=read_matches(line) imgs.append(img) inds.append(ind)

and in the dataset file you need to use that information

ind=inds[imgs.index(self.indices[i])] self.data[(width, height)].append((eqs[ind], im))

I've not tested it, so there can be mistakes. But that's the direction you have to go

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

With 464e4fc you can combine multiple datasets or generate one combined pkl file:

python dataset/dataset.py --equations dataset1/formulas.txt dataset2/formulas.txt --images dataset1/images dataset2/images --tokenizer dataset/tokenizer.json --out combined.pkl

Also the attachment does not transfer over to github.

aspnetcs commented 2 years ago

On the same data set (formulae), using different training algorithms in pytorch to train one after another, why is the final Bleu result always 0? ........ BLEU: 0.000, ED: 2.80e+00: 21%|████████████▌ | 80/389 [3:49:17<14:45:37, 171.97s/it] BLEU: 0.000, ED: 3.26e+00: 21%|████████████▌ | 80/389 [3:31:05<13:35:22, 158.32s/it] BLEU: 0.000, ED: 3.13e+00: 21%|████████████▌ | 80/389 [3:40:22<14:11:13, 165.29s/it] BLEU: 0.000, ED: 2.84e+00: 21%|████████████▌ | 80/389 [4:10:24<16:07:12, 187.81s/it] BLEU: 0.000, ED: 3.19e+00: 21%| .........

--

At 2022-01-09 20:09:49, "Lukas Blecher" @.***> wrote:

This is because the matching is not trivial. you would need to create a lookup table, like so

defread_matches(line): img, ind=line.split(' ') img=int(img.split('.')[0]) ind=int(ind) returnimg, indwithopen('training.matching.txt', 'r') asf: imgs, inds= [], [] forlineinf.readlines(): img, ind=read_matches(line) imgs.append(img) inds.append(ind)

and in the dataset file you need to use that information

ind=inds[imgs.index(self.indices[i])] self.data[(width, height)].append((eqs[ind], im))

I've not tested it, so there can be mistakes. But that's the direction you have to go

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

What are you showing me? That looks like the eval output. And why does it take so long for one iteration? What's your batch size? For large scale training you basically need a gpu. I don't know what you are doing but the model is not learning anything.

aspnetcs commented 2 years ago

If there is already voc.txt, how to turn it into tokenizer.json ?

--

At 2022-01-09 20:09:49, "Lukas Blecher" @.***> wrote:

This is because the matching is not trivial. you would need to create a lookup table, like so

defread_matches(line): img, ind=line.split(' ') img=int(img.split('.')[0]) ind=int(ind) returnimg, indwithopen('training.matching.txt', 'r') asf: imgs, inds= [], [] forlineinf.readlines(): img, ind=read_matches(line) imgs.append(img) inds.append(ind)

and in the dataset file you need to use that information

ind=inds[imgs.index(self.indices[i])] self.data[(width, height)].append((eqs[ind], im))

I've not tested it, so there can be mistakes. But that's the direction you have to go

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

! " & ' ( ) * + ,

-- . / 0 1 2 3 4 5 6 7 8 9 : ; <

? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ! # \, \/ \: \Big \Bigg \Biggl \Biggr \Bigl \Bigr \Delta \Gamma \Im \L \Lambda \Large \Leftrightarrow \Longleftrightarrow \Longrightarrow \O \Omega \P \Phi \Pi \Psi \Re \Rightarrow \S \Sigma \Theta \Upsilon \Vert \Xi \ _ \acute \aleph \alpha \approx \arccos \arcsin \arctan \arg \ast \atop \b \backslash \bar \begin{array} \begin{cases} \begin{matrix} \begin{picture} \beta \bf \big \bigcap \bigcup \bigg \biggl \biggr \bigl \bigoplus \bigotimes \bigr \bigtriangledown \bigtriangleup \bigwedge \binom \bmod \boldmath \bot \breve \buildrel \bullet \cal \cap \cdot \cdotp \cdots \check \chi \circ \circle \colon \cong \cos \cosh \cot \coth \cup \d \dag \dagger \ddot \ddots \deg \delta \det \diamond \diamondsuit \dim \displaystyle \dot \doteq \dots \downarrow \ell \emptyset \end{array} \end{cases} \end{matrix} \end{picture} \enskip \enspace \epsilon \equiv \eta \exp \fbox \flat \footnotesize \forall \frac \gamma \ge \geq \gg \hat \hbar \hfill \hline \hookrightarrow \hspace \i \imath \in \infty \int \iota \it \jmath \kappa \kern \l \label \lambda \land \langle \large \lbrace \lbrack \ldots \le \left( \left. \left< \left[ \left\langle \left\lbrack \left\vert \left{ \left\ \leftarrow \leftrightarrow \left \leq \lfloor \lim \line \ll \llap \ln \log \longleftrightarrow \longmapsto \longrightarrow \makebox \mapsto \mathbf \mathcal \mathit \mathop \mathrm \mathsf \max \mid \min \mit \mp \mu \nabla \natural \ne \neq \ni \noalign \nonumber \not \nu \o \odot \oint \omega \ominus \oplus \otimes \overbrace \overleftarrow \overline \overrightarrow \parallel \partial \perp \phantom \phi \pi \pm \pounds \prime \prod \propto \protect \psi \put \qquad \quad \raise \raisebox \rangle \rbrace \rbrack \ref \rfloor \rho \right) \right. \right> \right\rangle \right\rbrack \right\vert \right\ \right} \right] \rightarrow \rightharpoonup \right \rlap \sb \scriptscriptstyle \scriptsize \scriptstyle \sec \setlength \sf \sharp \sigma \sim \simeq \sin \sinh \sl \slash \small \smallskip \sp \space \sqrt \stackrel \star \strut \subset \subseteq \sum \sup \supset \tan \tanh \tau \textbf \textrm \textstyle \textup \theta \thinspace \tilde \times \tiny \to \triangle \tt \underbrace \underline \unitlength \uparrow \upsilon \varepsilon \varphi \varpi \varrho \varsigma \vartheta \vdots \vec \vee \vert \vline \vphantom \vspace \wedge \widehat \widetilde \wp \xi \zeta { \ } ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z {

} ~

aspnetcs commented 2 years ago

What is the content of the pkl file? I created a pkl file on one machine, and then put the pkl file on another machine for training. The training machine only has the pkl file, and there is no corresponding image and latex formula text. Is this correct?

--

At 2022-01-09 20:09:49, "Lukas Blecher" @.***> wrote:

This is because the matching is not trivial. you would need to create a lookup table, like so

defread_matches(line): img, ind=line.split(' ') img=int(img.split('.')[0]) ind=int(ind) returnimg, indwithopen('training.matching.txt', 'r') asf: imgs, inds= [], [] forlineinf.readlines(): img, ind=read_matches(line) imgs.append(img) inds.append(ind)

and in the dataset file you need to use that information

ind=inds[imgs.index(self.indices[i])] self.data[(width, height)].append((eqs[ind], im))

I've not tested it, so there can be mistakes. But that's the direction you have to go

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

aspnetcs commented 2 years ago

Different tokenizer.json files are generated through different latex formula texts. How to combine different tokenizer.json files into one tokenizer.json?

--

At 2022-01-09 20:09:49, "Lukas Blecher" @.***> wrote:

This is because the matching is not trivial. you would need to create a lookup table, like so

defread_matches(line): img, ind=line.split(' ') img=int(img.split('.')[0]) ind=int(ind) returnimg, indwithopen('training.matching.txt', 'r') asf: imgs, inds= [], [] forlineinf.readlines(): img, ind=read_matches(line) imgs.append(img) inds.append(ind)

and in the dataset file you need to use that information

ind=inds[imgs.index(self.indices[i])] self.data[(width, height)].append((eqs[ind], im))

I've not tested it, so there can be mistakes. But that's the direction you have to go

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

What is the content of the pkl file? I created a pkl file on one machine, and then put the pkl file on another machine for training. The training machine only has the pkl file, and there is no corresponding image and latex formula text. Is this correct?

The pkl file only contains the relative path to the images but does save the equation. So you will need to do recompile the pkl file on each machine and you need the images. To the other questions I can't give answers. I'm using huggingface tokenizers so you'll need to look there for more information.

aspnetcs commented 2 years ago

In your laxtex-ocr project, How to use lr_scheduler.CosineAnnealingWarmRestarts in pytorch for learning rate adjustment (https://pytorch.org/docs/stable/optim.html#optimizer-step-closure) ? How to use lr_scheduler.ChainedScheduler in pytorch for learning rate adjustment (https://pytorch.org/docs/stable/optim.html#optimizer-step-closure) ?

--

At 2021-12-29 20:59:48, "Lukas Blecher" @.***> wrote:

They should be diffently sized if the amount of data is different. looks like the path to the images is wrong

python dataset/dataset.py --equations dataset/data/preprocessx/Data-for-LaTeX_OCR/full/formulas/train.formulas.norm.txt --images dataset/data/preprocessx/Data-for-LaTeX_OCR/full/images/images_train --tokenizer dataset/tokenizer.json --out dataset/data/preprocessx/Data-for-LaTeX_OCR/full/train_full.pkl

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

aspnetcs commented 2 years ago

Can your project provide the API for easy calling?

like following https://github.com/aspnetcs/image-to-latex-main

--

At 2021-12-29 20:59:48, "Lukas Blecher" @.***> wrote:

They should be diffently sized if the amount of data is different. looks like the path to the images is wrong

python dataset/dataset.py --equations dataset/data/preprocessx/Data-for-LaTeX_OCR/full/formulas/train.formulas.norm.txt --images dataset/data/preprocessx/Data-for-LaTeX_OCR/full/images/images_train --tokenizer dataset/tokenizer.json --out dataset/data/preprocessx/Data-for-LaTeX_OCR/full/train_full.pkl

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

I've added a similar API now

aspnetcs commented 2 years ago

similar API,which is include ......?

Would you like to describe it in detail? Can these interfaces be exposed and then called, similar to mathpix?

--

At 2022-05-01 16:18:16, "Lukas Blecher" @.***> wrote:

I've added a similar API now

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

I don't know about the mathpix api. There is an api running and you can connect to it via a streamlit demo, like in https://github.com/kingyiusuen/image-to-latex You can find more info in the readme

aspnetcs commented 2 years ago

It's a miracle and it's so well done!!! Would you like to make a front end, similar to mathpix, or (https://github.com/lukas-blecher/LaTeX-OCR) see attachments that can be intercepted for formula recognition. How do I extract formulas from a latex file to augment a dataset?

--

At 2022-05-01 17:42:11, "Lukas Blecher" @.***> wrote:

I don't know about the mathpix api. There is an api running and you can connect to it via a streamlit demo, like in https://github.com/kingyiusuen/image-to-latex You can find more info in the readme

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

aspnetcs commented 2 years ago

Sorry, kingyiusuen/image-to-latex: Convert images of LaTex math equations into LaTex code. (github.com)

lukas-blecher/LaTeX-OCR: pix2tex: Using a ViT to convert images of equations into LaTeX code. (github.com)

Can you put the above two projects together?

How to extract formulas from latex to augment a dataset?

--

At 2022-05-01 17:42:11, "Lukas Blecher" @.***> wrote:

I don't know about the mathpix api. There is an api running and you can connect to it via a streamlit demo, like in https://github.com/kingyiusuen/image-to-latex You can find more info in the readme

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

I don't know what you mean by combining the projects. you can extract equations from the latex source by using the script arxiv.py in pix2tex.dataset.arxiv.

aspnetcs commented 2 years ago

combining the projects

What I mean is that it is best for your project (https://github.com/lukas-blecher/LaTeX-OCR) to also provide an api interface, just like this project (https://github.com/kingyiusuen/image-to-latex), it is easy to call.

I just found out that your project has already implemented this function.

you are so great!

Can you make a function that uses the key to call the interface? Similar to mathpix, kedaxunfei (https://www.xfyun.cn/doc/words/formula-discern/API.html#%E6%8E%A5%E5%8F%A3%E8%B0%83%E7%94 %A8%E6%B5%81%E7%A8%8B) similar like the following

appid xxxxxx apisecret xxxxxxxxxxxxxxxxxxxx apikey xxxxxxxxxxxxx

--

At 2022-05-01 19:39:51, "Lukas Blecher" @.***> wrote:

I don't know what you mean by combining the projects. you can extract equations from the latex source by using the script arxiv.py in pix2tex.dataset.arxiv.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

lukas-blecher commented 2 years ago

I don't see a reason to implement this functionality. I am not planning to deploy the API. It is meant as a local, self hosted interaction point.

If you need this, you will have to implement it yourself.

aspnetcs commented 2 years ago

It's hard for me to complete this function myself, because I don't have the ability, I don't have the time, hey

--

At 2022-05-03 18:13:18, "Lukas Blecher" @.***> wrote:

I don't see a reason to implement this functionality. I am not planning to deploy the API. It is meant as a local, self hosted interaction point.

If you need this, you will have to implement it yourself.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>