lukas-blecher / LaTeX-OCR

pix2tex: Using a ViT to convert images of equations into LaTeX code.
https://lukas-blecher.github.io/LaTeX-OCR/
MIT License
12.95k stars 1.05k forks source link

How can I generate training data ? #218

Open Rahul99887trt opened 2 years ago

Rahul99887trt commented 2 years ago

As you've mentioned, to generate dataset the command should go like this "python -m pix2tex.dataset.dataset --equations path_to_textfile --images path_to_images --out dataset.pkl" and the dataset class basically takes the filename and splits it from "." and takes that first splited part as the line number of the equation file and grab that line as the ground truth data of the image.(Correct me if I'm wrong). My question is how can I generate my own dataset to train the model. Is it necessary to have "000" kind of stuffs before the file name of the image ? Because your google drive data is kind of confusing ! it contains

  1. Train Images -> 158480 number of files
  2. Valid Images -> 6780 number of files
  3. Test Images -> 30637 number of files 195882 Files in total but your ground truth data contains more than that. I found 234484 lines in math.txt file . Can you please explain how can align my own data to generate the dataset.pkl file. Or how can I name the image files and how can I write my math.txt file ?
KoushikMSD commented 2 years ago

Align your data like : LineNumber = imageName +1 for an example if the equation is in line 3 , then the name of the image should be 2 .