opendatalab / PDF-Extract-Kit

A Comprehensive Toolkit for High-Quality PDF Content Extraction
https://pdf-extract-kit.readthedocs.io/zh-cn/latest/index.html
GNU Affero General Public License v3.0
5.39k stars 360 forks source link

Use latest UniMERNet model for Formula Recognition #130

Open ChenZiHong-Gavin opened 1 month ago

ChenZiHong-Gavin commented 1 month ago

I noticed that the UniMERNet model weight provided is not the latest version. The latest version is larger but better, and can lead to better formula recognition results, as I have also observed many complaints about inaccurate formula recognition in the issues.

For example:

latest UniMERNet version: huggingface: https://huggingface.co/wanderkid/unimernet/tree/main modelscope: https://www.modelscope.cn/models/wanderkid/UniMERNet/files

test img: image former model leads to confusing result: image latest model will output the right answer: image

ChenZiHong-Gavin commented 1 month ago

noticed you have updated unimernet_base | unimernet_small | unimernet_tiny however, the unimernet version is still not the latest. your pytorch_model.bin is 3.75 GB image in https://huggingface.co/wanderkid/unimernet/tree/main the pytorch_model.bin is 4.91 GB image