harvardnlp / im2markup

Neural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)
https://im2markup.yuntiandeng.com
MIT License
1.21k stars 214 forks source link

found some data label unconsistence #23

Open Zhang-O opened 5 years ago

Zhang-O commented 5 years ago

51238 1a00a76d4e basic in im2latex_train.lst latexs around line 51238 in im2latex_formulas.lst are not the latex content in pic 1a00a76d4e. 1a00a76d4e should point to line 51729 in im2latex_formulas.lst. I have found some of this case, but not sure how many. I download data from https://zenodo.org/record/56198#.XZ7yK_n_yHt. Is anything wrong?

Miffyli commented 5 years ago

Hey, did you open the files correctly? See this quote from the Zenodo webpage:

Newlines used in formulas_im2latex.lst are UNIX-style newlines (\n). Reading file with other type of newlines results to slightly wrong amount of lines (104563 instead of 103558), and thus breaks the structure used by this dataset. Python 3.x reads files using newlines of the running system by default, and to avoid this file must be opened with newlines="\n" (eg. open("formulas_im2latex.lst", newline="\n")).

Zhang-O commented 5 years ago

sorry to waste your time.I see the web again, and chect what you said. I found formulas_im2latex.lst with lines of 104564. I open it using notepad++ with line ending \n. what is wrong? thanks very much.

Zhang-O commented 5 years ago

f = open("./im2latex_formulas.lst", encoding="ISO-8859-1",newline="\n") len(f.readlines()) = 103359 when epen file with nptepad++ ,changing encoding will not change the lines of file. almost an hour for me to check it out. thanks again.

Miffyli commented 5 years ago

Hmm that is peculiar: I downloaded the im2latex_formulas.lst from zenodo and ran the following (Windows 10, Python 3.6):

f = open("./im2latex_formulas.lst", newline="\n")
len(f.readlines())
Out[11]: 103559

f = open("./im2latex_formulas.lst", encoding="ISO-8859-1",newline="\n")
len(f.readlines())
Out[13]: 103559

I do not think changing the encoding helps, it is the way newlines are handled differently in different OSes.

kim-yhow commented 5 years ago

51238 1a00a76d4e basic in im2latex_train.lst latexs around line 51238 in im2latex_formulas.lst are not the latex content in pic 1a00a76d4e. 1a00a76d4e should point to line 51729 in im2latex_formulas.lst. I have found some of this case, but not sure how many. I download data from https://zenodo.org/record/56198#.XZ7yK_n_yHt. Is anything wrong?

Excuse me, I am also interested in this project. and are you still doing formula recognition? Have you successfully reproduced the results of EM in the paper?