opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
12.25k stars 911 forks source link

Arabic Support: Fix arabic txt extracted ocr from pdf #677

Open abdelkrimkr opened 4 days ago

abdelkrimkr commented 4 days ago

Description of the bug | 错误描述

Support arabic

How to reproduce the bug | 如何复现

The letters are always extracted in English or the arabic text is not recognized and is cut out as an image. Also the arabic writing is reversed The best solution is to do ocr support arabic and fixes errors

Operating system | 操作系统

Linux

Python version | Python 版本

3.9

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cpu

myhloli commented 4 days ago

We will support at next version.

AlhathloulMaha commented 1 day ago

How long this will takes?