opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
18.23k stars 1.31k forks source link

Error related to script #987

Closed Akshaybhure111 closed 6 days ago

Akshaybhure111 commented 6 days ago

Description of the bug | 错误描述

as per your sujjession I have made changes but getting new error below is updated scripts import os from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

Arguments model_list = [] pdf_file_name = "/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path

Prepare environment output_image_path, output_path = "output/images", "output" os.makedirs(output_image_path, exist_ok=True)

Initialize data readers and writers image_writer = DiskReaderWriter(output_image_path) md_writer = DiskReaderWriter(output_path)

Read PDF content reader_writer = DiskReaderWriter("") pdf_bytes = reader_writer.read(pdf_file_name)

Initialize and process the OCR pipeline pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze()

The pipe_parse stage pipe.pipe_parse()

Extract parsed information pdf_info = pipe.pdf_mid_data["pdf_info"]

Generate markdown content md_content = pipe.pipe_mk_markdown( output_image_path, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

Write markdown content to file if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

my original script is below as per your documentation provided at MinurU import os from magic_pdf.data.data_reader_writer.filebase import FileBasedDataReader, FileBasedDataWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

Arguments model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path

Prepare environment local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

Initialize data writers image_writer = FileBasedDataWriter(local_image_dir) md_writer = FileBasedDataWriter(local_md_dir)

Read PDF content reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name)

Initialize and process the OCR pipeline pipe = OCRPipe(pdf_bytes, model_list,image_writer)

pipe.pipe_classify() pipe.pipe_analyze()

The pipe_parse stage, now fixed pipe.pipe_parse()

Extract parsed information pdf_info = pipe.pdf_mid_data["pdf_info"]

Generate markdown content md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

Write markdown content to file if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

but as I mention error previous bug its getting can you please in my above provided code what changes need to do and please provide me whole code scripts so I can run successfully

How to reproduce the bug | 如何复现

as per your sujjession I have made changes but getting new error below is updated scripts import os from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

Arguments model_list = [] pdf_file_name = "/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path

Prepare environment output_image_path, output_path = "output/images", "output" os.makedirs(output_image_path, exist_ok=True)

Initialize data readers and writers image_writer = DiskReaderWriter(output_image_path) md_writer = DiskReaderWriter(output_path)

Read PDF content reader_writer = DiskReaderWriter("") pdf_bytes = reader_writer.read(pdf_file_name)

Initialize and process the OCR pipeline pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze()

The pipe_parse stage pipe.pipe_parse()

Extract parsed information pdf_info = pipe.pdf_mid_data["pdf_info"]

Generate markdown content md_content = pipe.pipe_mk_markdown( output_image_path, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

Write markdown content to file if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

my original script is below as per your documentation provided at MinurU import os from magic_pdf.data.data_reader_writer.filebase import FileBasedDataReader, FileBasedDataWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

Arguments model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path

Prepare environment local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

Initialize data writers image_writer = FileBasedDataWriter(local_image_dir) md_writer = FileBasedDataWriter(local_md_dir)

Read PDF content reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name)

Initialize and process the OCR pipeline pipe = OCRPipe(pdf_bytes, model_list,image_writer)

pipe.pipe_classify() pipe.pipe_analyze()

The pipe_parse stage, now fixed pipe.pipe_parse()

Extract parsed information pdf_info = pipe.pdf_mid_data["pdf_info"]

Generate markdown content md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

Write markdown content to file if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

but as I mention error previous bug its getting can you please in my above provided code what changes need to do and please provide me whole code scripts so I can run successfully

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

myhloli commented 6 days ago

Why not use this script?

https://github.com/opendatalab/MinerU/blob/master/demo/magic_pdf_parse_main.py

Akshaybhure111 commented 6 days ago

actually I want use it for different purpose in my code. and as you mention that script on official Miner U it should work so I need resolution. as you mention there are lot of files I need to preserve so I don't want that much stuff I want install modules and through this script need to run

import os

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

args

model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # replace with the real pdf path

prepare env

local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( local_md_dir ) # create 00 image_dir = str(os.path.basename(local_image_dir))

reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name) # read the pdf content

pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze() pipe.pipe_parse()

pdf_info = pipe.pdf_mid_data["pdf_info"]

md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

please reolve my issue in above script
TypeError                                 Traceback (most recent call last)

in <cell line: 31>() 29 pipe.pipe_classify() 30 pipe.pipe_analyze() ---> 31 pipe.pipe_parse() 32 33 pdf_info = pipe.pdf_mid_data["pdf_info"]

6 frames /usr/local/lib/python3.10/dist-packages/magic_pdf/libs/pdf_image_tools.py in cut_image(bbox, page_num, page, return_path, imageWriter) 29 byte_data = pix.tobytes(output='jpeg', jpg_quality=95) 30 ---> 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) 32 33 return img_hash256_path

TypeError: FileBasedDataWriter.write() takes 3 positional arguments but 4 were given once you can run this script you will get Idea

myhloli commented 6 days ago

please read the code in demo dir,and mod it.you can control any file which you need output.

Akshaybhure111 commented 6 days ago

I have follwed whole instructions but same issue I am facing. https://mineru.readthedocs.io/en/latest/user_guide/quick_start/to_markdown.html as I mentioned link you can also see same script I am running. with proper code reading with documentation but same below error TypeError Traceback (most recent call last) in <cell line: 31>() 29 pipe.pipe_classify() 30 pipe.pipe_analyze() ---> 31 pipe.pipe_parse() 32 33 pdf_info = pipe.pdf_mid_data["pdf_info"]

6 frames /usr/local/lib/python3.10/dist-packages/magic_pdf/libs/pdf_image_tools.py in cut_image(bbox, page_num, page, return_path, imageWriter) 29 byte_data = pix.tobytes(output='jpeg', jpg_quality=95) 30 ---> 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) 32 33 return img_hash256_path

TypeError: FileBasedDataWriter.write() takes 3 positional arguments but 4 were given I dont want controll on output my code is failing due to above error. I want resolve this error and in your reposatory you can check this error. related to image issue so thas I want resolve

myhloli commented 6 days ago

Don't read the doc in https://mineru.readthedocs.io/, it's not ready.

Akshaybhure111 commented 6 days ago

ok thanks. so can I consider that this tool is not ready through which way I am trying to run it right? if right then consider this issue in future for solving.

myhloli commented 6 days ago

ok thanks. so can I consider that this tool is not ready through which way I am trying to run it right? if right then consider this issue in future for solving.

We plan to launch a dedicated documentation website called “Next-docs” in the future, but the new documentation is not yet ready. For now, we recommend reading the README on GitHub for a quick overview of the project and trying out the example code in the demo directory to implement advanced features. Of course, using the command-line tools or the online demo are also convenient options. If you are familiar with notebooks, you can also explore our project further in Colab.

icecraft commented 5 days ago

@Akshaybhure111 sorry to bother you, this issue will be fixed under the next release.

Akshaybhure111 commented 4 days ago

Thank you. I really appreciate your work efforts