Efficient processing OPs for scanned images and pdf

yxdyc commented 1 month ago

Search before continuing 先搜索，再继续

[X] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。

Description 描述

There is a large amount of valuable data in the format of scanned images and PDFs. We can continuously discuss and list related processing operations to be added into DJ in this thread. Some pioneering works include MAP-NEO and PDF-Extract-Kit.

Use case 使用场景

No response

Additional 额外信息

No response

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR？

[X] Yes I'd like to help by submitting a PR! 是的！我愿意提供帮助并提交一个PR！

github-actions[bot] commented 3 weeks ago

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] commented 3 weeks ago

Close this stale issue.

yxdyc commented 1 week ago

WIP by @Qirui-jiao @HYLcool @yxdyc

modelscope / data-juicer