jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.62k stars 3.38k forks source link

Support for mathpix markdown #9067

Open 6801318d8d opened 1 year ago

6801318d8d commented 1 year ago

Can we support (mathpix-markdown)[https://github.com/Mathpix/mathpix-markdown-it]?

It the markdown dialect used by nougat.

L-960 commented 9 months ago

Can we support (mathpix-markdown)[https://github.com/Mathpix/mathpix-markdown-it]?

It the markdown dialect used by nougat.

I have a same question, can we support mathpix markdown?

tarleb commented 6 months ago

Could someone describe the main differences to pandoc flavored Markdown and/or CommonMark?

utensil commented 6 months ago

Could someone describe the main differences to pandoc flavored Markdown and/or CommonMark?

Essentially, it makes use of some LaTeX syntax and variants of Markdown syntax to provide better support for equation numbering and referencing, tables, figure referencing, abstracts, author lists, linkable sections, theorems and proofs etc.

The motivation was that it needs the ability to represent academic papers faithfully for OCR from images. Markdown lacks many elements and precise controls for that purpose.

This format is originally used by Mathpix. See Mathpix Markdown Syntax Reference for more info.

utensil commented 6 months ago

The workaround is to use mpx cli to convert mmd to tex then use pandoc, but not vice versa, unless going through tex -> pdf -OCR-> mmd which is lossy.

But I personally believe that mmd qualifies as one of document formats that pandoc could support natively.

LymanY commented 2 weeks ago

same question

utensil commented 2 weeks ago

Relevant news: 138,830 arXiv papers as a dataset called arxiver is created via Nougat in this format.