opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.8k stars 893 forks source link

Issues with Equation Handling, Header/Footer Extraction, and Bounding Box Conversion for Tables #679

Open Akhilesh-pandey1 opened 2 days ago

Akhilesh-pandey1 commented 2 days ago
  1. Equation Parsing Issue: The extraction of equations within the "interline_equation" section is incorrect. I would like to know if there is a way to avoid extracting these equations as text and instead generate an image of the equation.

  2. Header and Footer Extraction: Is there a configuration option available to extract the text and images from the headers and footers of the PDF?

  3. Configuration Documentation: May you please tell me what what things we can change through config and how ?? I know some like table, model directory and device mode, But I want to know all.

  4. Bounding Box Conversion for Camelot: I’ve been working on converting bounding boxes from MinerU to Camelot in order to extract table data. Do you have any idea how can we do it in better way. While the method works for some PDFs, it fails with others. Here’s the code I am using: """ def convert_mineru_to_camelot_bbox(input_bbox, pdf_path): width, height = get_camelot_page_width_height(pdf_path) x1, y1, x2, y2 = input_bbox new_y1 = height - y1 new_y2 = height - y2 new_bbox = [x1, new_y1, x2, new_y2] logger.info(f"Converted {input_bbox} to {new_bbox}") return new_bbox """

Thanks for developing such a fantastic tool for parsing and OCR, and for making it open source! Your efforts in creating such detailed and user-friendly software are truly appreciated! 🙌✨

myhloli commented 2 days ago
  1. you can found them in xxxx_middle.json
  2. many config not define in magic-pdf.json,you need to read source code and use api to input some args.
  3. maybe dpi is different between them,we use dpi 72 in mineru.