Akshaybhure111 commented 1 week ago

Description of the bug | 错误描述

Main Code

import os

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

args

model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # replace with the real pdf path

prepare env

local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( local_md_dir ) # create 00 image_dir = str(os.path.basename(local_image_dir))

reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name) # read the pdf content

pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze() pipe.pipe_parse()

pdf_info = pipe.pdf_mid_data["pdf_info"]

md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content) When I am Running above script below error is getting

6 frames /usr/local/lib/python3.10/dist-packages/magic_pdf/libs/pdf_image_tools.py in cut_image(bbox, page_num, page, return_path, imageWriter) 29 byte_data = pix.tobytes(output='jpeg', jpg_quality=95) 30 ---> 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) 32 33 return img_hash256_path

TypeError: FileBasedDataWriter.write() takes 3 positional arguments but 4 were given I have checked in your method there are 2 arguments were passed in FileBasedDataWriter but 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) here AbsReaderWriter.MODE_BIN this its saying extra argument please check it and resolve

How to reproduce the bug | 如何复现

Main Code

import os

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

args

model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # replace with the real pdf path

prepare env

local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( local_md_dir ) # create 00 image_dir = str(os.path.basename(local_image_dir))

reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name) # read the pdf content

pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze() pipe.pipe_parse()

pdf_info = pipe.pdf_mid_data["pdf_info"]

md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content) When I am Running above script below error is getting

6 frames /usr/local/lib/python3.10/dist-packages/magic_pdf/libs/pdf_image_tools.py in cut_image(bbox, page_num, page, return_path, imageWriter) 29 byte_data = pix.tobytes(output='jpeg', jpg_quality=95) 30 ---> 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) 32 33 return img_hash256_path

TypeError: FileBasedDataWriter.write() takes 3 positional arguments but 4 were given I have checked in your method there are 2 arguments were passed in FileBasedDataWriter but 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) here AbsReaderWriter.MODE_BIN this its saying extra argument please check it and resolve

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

myhloli commented 1 week ago

You should use

from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
image_writer, md_writer = DiskReaderWriter(output_image_path), DiskReaderWriter(output_path)

to init image_writer and md_writer

Akshaybhure111 commented 1 week ago

as per your sujjession I have made changes but getting new error

below is updated scripts

import os from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

Arguments

model_list = [] pdf_file_name = "/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path

Prepare environment

output_image_path, output_path = "output/images", "output" os.makedirs(output_image_path, exist_ok=True)

Initialize data readers and writers

image_writer = DiskReaderWriter(output_image_path) md_writer = DiskReaderWriter(output_path)

Read PDF content

reader_writer = DiskReaderWriter("") pdf_bytes = reader_writer.read(pdf_file_name)

Initialize and process the OCR pipeline

pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze()

The pipe_parse stage

pipe.pipe_parse()

Extract parsed information

pdf_info = pipe.pdf_mid_data["pdf_info"]

Generate markdown content

md_content = pipe.pipe_mk_markdown( output_image_path, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

Write markdown content to file

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

my original script is below as per your documentation provided at MinurU

import os from magic_pdf.data.data_reader_writer.filebase import FileBasedDataReader, FileBasedDataWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

Arguments

model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path

Prepare environment

local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

Initialize data writers

image_writer = FileBasedDataWriter(local_image_dir) md_writer = FileBasedDataWriter(local_md_dir)

Read PDF content

reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name)

Initialize and process the OCR pipeline

pipe = OCRPipe(pdf_bytes, model_list,image_writer)

pipe.pipe_classify() pipe.pipe_analyze()

The pipe_parse stage, now fixed

pipe.pipe_parse()

Extract parsed information

pdf_info = pipe.pdf_mid_data["pdf_info"]

Generate markdown content

md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

Write markdown content to file

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

but as I mention error previous bug its getting

can you please in my above provided code what changes need to do and please provide me whole code scripts so I can run successfully

opendatalab / MinerU

argument expect 3 but 4 given #980

Description of the bug | 错误描述

Main Code

args

prepare env

How to reproduce the bug | 如何复现

Main Code

args

prepare env

Operating system | 操作系统

Python version | Python 版本

Software version | 软件版本 (magic-pdf --version)

Device mode | 设备模式

as per your sujjession I have made changes but getting new error

below is updated scripts

Arguments

Prepare environment

Initialize data readers and writers

Read PDF content

Initialize and process the OCR pipeline

The pipe_parse stage

Extract parsed information

Generate markdown content

Write markdown content to file

my original script is below as per your documentation provided at MinurU

Arguments

Prepare environment

Initialize data writers

Read PDF content

Initialize and process the OCR pipeline

The pipe_parse stage, now fixed

Extract parsed information

Generate markdown content

Write markdown content to file

but as I mention error previous bug its getting