Closed Akshaybhure111 closed 6 days ago
Why not use this script?
https://github.com/opendatalab/MinerU/blob/master/demo/magic_pdf_parse_main.py
actually I want use it for different purpose in my code. and as you mention that script on official Miner U it should work so I need resolution. as you mention there are lot of files I need to preserve so I don't want that much stuff I want install modules and through this script need to run
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe
model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # replace with the real pdf path
local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( local_md_dir ) # create 00 image_dir = str(os.path.basename(local_image_dir))
reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify() pipe.pipe_analyze() pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"]
md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )
if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)
please reolve my issue in above script
TypeError Traceback (most recent call last)
6 frames /usr/local/lib/python3.10/dist-packages/magic_pdf/libs/pdf_image_tools.py in cut_image(bbox, page_num, page, return_path, imageWriter) 29 byte_data = pix.tobytes(output='jpeg', jpg_quality=95) 30 ---> 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) 32 33 return img_hash256_path
TypeError: FileBasedDataWriter.write() takes 3 positional arguments but 4 were given once you can run this script you will get Idea
please read the code in demo dir,and mod it.you can control any file which you need output.
I have follwed whole instructions but same issue I am facing.
https://mineru.readthedocs.io/en/latest/user_guide/quick_start/to_markdown.html
as I mentioned link you can also see same script I am running. with proper code reading with documentation but same below error
TypeError Traceback (most recent call last)
6 frames /usr/local/lib/python3.10/dist-packages/magic_pdf/libs/pdf_image_tools.py in cut_image(bbox, page_num, page, return_path, imageWriter) 29 byte_data = pix.tobytes(output='jpeg', jpg_quality=95) 30 ---> 31 imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) 32 33 return img_hash256_path
TypeError: FileBasedDataWriter.write() takes 3 positional arguments but 4 were given I dont want controll on output my code is failing due to above error. I want resolve this error and in your reposatory you can check this error. related to image issue so thas I want resolve
Don't read the doc in https://mineru.readthedocs.io/, it's not ready.
ok thanks. so can I consider that this tool is not ready through which way I am trying to run it right? if right then consider this issue in future for solving.
ok thanks. so can I consider that this tool is not ready through which way I am trying to run it right? if right then consider this issue in future for solving.
We plan to launch a dedicated documentation website called “Next-docs” in the future, but the new documentation is not yet ready. For now, we recommend reading the README on GitHub for a quick overview of the project and trying out the example code in the demo directory to implement advanced features. Of course, using the command-line tools or the online demo are also convenient options. If you are familiar with notebooks, you can also explore our project further in Colab.
@Akshaybhure111 sorry to bother you, this issue will be fixed under the next release.
Thank you. I really appreciate your work efforts
Description of the bug | 错误描述
as per your sujjession I have made changes but getting new error below is updated scripts import os from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe
Arguments model_list = [] pdf_file_name = "/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path
Prepare environment output_image_path, output_path = "output/images", "output" os.makedirs(output_image_path, exist_ok=True)
Initialize data readers and writers image_writer = DiskReaderWriter(output_image_path) md_writer = DiskReaderWriter(output_path)
Read PDF content reader_writer = DiskReaderWriter("") pdf_bytes = reader_writer.read(pdf_file_name)
Initialize and process the OCR pipeline pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify() pipe.pipe_analyze()
The pipe_parse stage pipe.pipe_parse()
Extract parsed information pdf_info = pipe.pdf_mid_data["pdf_info"]
Generate markdown content md_content = pipe.pipe_mk_markdown( output_image_path, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )
Write markdown content to file if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)
my original script is below as per your documentation provided at MinurU import os from magic_pdf.data.data_reader_writer.filebase import FileBasedDataReader, FileBasedDataWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe
Arguments model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path
Prepare environment local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)
Initialize data writers image_writer = FileBasedDataWriter(local_image_dir) md_writer = FileBasedDataWriter(local_md_dir)
Read PDF content reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name)
Initialize and process the OCR pipeline pipe = OCRPipe(pdf_bytes, model_list,image_writer)
pipe.pipe_classify() pipe.pipe_analyze()
The pipe_parse stage, now fixed pipe.pipe_parse()
Extract parsed information pdf_info = pipe.pdf_mid_data["pdf_info"]
Generate markdown content md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )
Write markdown content to file if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)
but as I mention error previous bug its getting can you please in my above provided code what changes need to do and please provide me whole code scripts so I can run successfully
How to reproduce the bug | 如何复现
as per your sujjession I have made changes but getting new error below is updated scripts import os from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe
Arguments model_list = [] pdf_file_name = "/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path
Prepare environment output_image_path, output_path = "output/images", "output" os.makedirs(output_image_path, exist_ok=True)
Initialize data readers and writers image_writer = DiskReaderWriter(output_image_path) md_writer = DiskReaderWriter(output_path)
Read PDF content reader_writer = DiskReaderWriter("") pdf_bytes = reader_writer.read(pdf_file_name)
Initialize and process the OCR pipeline pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify() pipe.pipe_analyze()
The pipe_parse stage pipe.pipe_parse()
Extract parsed information pdf_info = pipe.pdf_mid_data["pdf_info"]
Generate markdown content md_content = pipe.pipe_mk_markdown( output_image_path, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )
Write markdown content to file if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)
my original script is below as per your documentation provided at MinurU import os from magic_pdf.data.data_reader_writer.filebase import FileBasedDataReader, FileBasedDataWriter from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.pipe.OCRPipe import OCRPipe
Arguments model_list = [] pdf_file_name = r"/content/sample_data/insulin_pump_pdf.pdf" # Replace with the actual PDF path
Prepare environment local_image_dir, local_md_dir = "output/images", "output" os.makedirs(local_image_dir, exist_ok=True)
Initialize data writers image_writer = FileBasedDataWriter(local_image_dir) md_writer = FileBasedDataWriter(local_md_dir)
Read PDF content reader1 = FileBasedDataReader("") pdf_bytes = reader1.read(pdf_file_name)
Initialize and process the OCR pipeline pipe = OCRPipe(pdf_bytes, model_list,image_writer)
pipe.pipe_classify() pipe.pipe_analyze()
The pipe_parse stage, now fixed pipe.pipe_parse()
Extract parsed information pdf_info = pipe.pdf_mid_data["pdf_info"]
Generate markdown content md_content = pipe.pipe_mk_markdown( image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD )
Write markdown content to file if isinstance(md_content, list): md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content)) else: md_writer.write_string(f"{pdf_file_name}.md", md_content)
but as I mention error previous bug its getting can you please in my above provided code what changes need to do and please provide me whole code scripts so I can run successfully
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.9.x
Device mode | 设备模式
cuda