opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
18.72k stars 1.33k forks source link

Detection of Umlaut / vowel mutation in German OCR #1073

Open myjob opened 3 days ago

myjob commented 3 days ago

Description of the bug | 错误描述

as reported in issue #708, detection of Umlaut / vowel mutation (äöüÄÜÖ) in German OCR isnt working well. Furthermore, french accents are not well identified (éèÀ); see attachment miner-u-lang-euro-ocr-test_origin.pdf miner-u-lang-euro-ocr-test.md

How to reproduce the bug | 如何复现

magic-pdf -p miner-u-lang-euro-test.pdf -o ./out -m ocr -l german or magic-pdf -p miner-u-lang-euro-test.pdf -o ./out -m ocr -l french

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cpu

Olol123579 commented 3 days ago

-event="{"category":"Marketing nav","action":"click to go to homepage","label":"ref_page:Marketing;ref_cta:Logomark;ref_loc:Header"}">

Olol123579 commented 3 days ago

-event="{"category":"Marketing nav","action":"click to go to homepage","label":"ref_page:Marketing;ref_cta:Logomark;ref_loc:Header"}">