opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
18.63k stars 1.33k forks source link

argument expect 3 but 4 given #980

Closed Akshaybhure111 closed 1 week ago

Akshaybhure111 commented 1 week ago

Description of the bug | 错误描述

Main Code

import os

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

args

model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # replace with the real pdf path

prepare env

local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( local_md_dir ) # create 00 image_dir = str(os.path.basename(local_image_dir))

reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name) # read the pdf content

pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze() pipe.pipe_parse()

pdf_info = pipe.pdf_mid_data["pdf_info"]

md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content) When I am Running above script below error is getting

6 frames /usr/local/lib/python3.10/dist-packages/magic_pdf/libs/pdf_image_tools.py in cut_image(bbox, page_num, page, return_path, imageWriter) 29 byte_data = pix.tobytes(output='jpeg', jpg_quality=95) 30 ---> 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) 32 33 return img_hash256_path

TypeError: FileBasedDataWriter.write() takes 3 positional arguments but 4 were given I have checked in your method there are 2 arguments were passed in FileBasedDataWriter but 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) here AbsReaderWriter.MODE_BIN this its saying extra argument please check it and resolve

How to reproduce the bug | 如何复现

Main Code

import os

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

args

model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # replace with the real pdf path

prepare env

local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( local_md_dir ) # create 00 image_dir = str(os.path.basename(local_image_dir))

reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name) # read the pdf content

pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze() pipe.pipe_parse()

pdf_info = pipe.pdf_mid_data["pdf_info"]

md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content) When I am Running above script below error is getting

6 frames /usr/local/lib/python3.10/dist-packages/magic_pdf/libs/pdf_image_tools.py in cut_image(bbox, page_num, page, return_path, imageWriter) 29 byte_data = pix.tobytes(output='jpeg', jpg_quality=95) 30 ---> 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) 32 33 return img_hash256_path

TypeError: FileBasedDataWriter.write() takes 3 positional arguments but 4 were given I have checked in your method there are 2 arguments were passed in FileBasedDataWriter but 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) here AbsReaderWriter.MODE_BIN this its saying extra argument please check it and resolve

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

myhloli commented 1 week ago

You should use

from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
image_writer, md_writer = DiskReaderWriter(output_image_path), DiskReaderWriter(output_path)

to init image_writer and md_writer

Akshaybhure111 commented 1 week ago

as per your sujjession I have made changes but getting new error

below is updated scripts

import os from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

Arguments

model_list = [] pdf_file_name = "/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path

Prepare environment

output_image_path, output_path = "output/images", "output" os.makedirs(output_image_path, exist_ok=True)

Initialize data readers and writers

image_writer = DiskReaderWriter(output_image_path) md_writer = DiskReaderWriter(output_path)

Read PDF content

reader_writer = DiskReaderWriter("") pdf_bytes = reader_writer.read(pdf_file_name)

Initialize and process the OCR pipeline

pipe = OCRPipe(pdf_bytes, model_list, image_writer)

pipe.pipe_classify() pipe.pipe_analyze()

The pipe_parse stage

pipe.pipe_parse()

Extract parsed information

pdf_info = pipe.pdf_mid_data["pdf_info"]

Generate markdown content

md_content = pipe.pipe_mk_markdown( output_image_path, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

Write markdown content to file

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

my original script is below as per your documentation provided at MinurU

import os from magic_pdf.data.data_reader_writer.filebase import FileBasedDataReader, FileBasedDataWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe

Arguments

model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path

Prepare environment

local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)

Initialize data writers

image_writer = FileBasedDataWriter(local_image_dir) md_writer = FileBasedDataWriter(local_md_dir)

Read PDF content

reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name)

Initialize and process the OCR pipeline

pipe = OCRPipe(pdf_bytes, model_list,image_writer)

pipe.pipe_classify() pipe.pipe_analyze()

The pipe_parse stage, now fixed

pipe.pipe_parse()

Extract parsed information

pdf_info = pipe.pdf_mid_data["pdf_info"]

Generate markdown content

md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )

Write markdown content to file

if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)

but as I mention error previous bug its getting

can you please in my above provided code what changes need to do and please provide me whole code scripts so I can run successfully