Equation Parsing Issue: The extraction of equations within the "interline_equation" section is incorrect. I would like to know if there is a way to avoid extracting these equations as text and instead generate an image of the equation.
Header and Footer Extraction: Is there a configuration option available to extract the text and images from the headers and footers of the PDF?
Configuration Documentation: May you please tell me what what things we can change through config and how ?? I know some like table, model directory and device mode, But I want to know all.
Bounding Box Conversion for Camelot: I’ve been working on converting bounding boxes from MinerU to Camelot in order to extract table data. Do you have any idea how can we do it in better way. While the method works for some PDFs, it fails with others. Here’s the code I am using:
"""
def convert_mineru_to_camelot_bbox(input_bbox, pdf_path):
width, height = get_camelot_page_width_height(pdf_path)
x1, y1, x2, y2 = input_bbox
new_y1 = height - y1
new_y2 = height - y2
new_bbox = [x1, new_y1, x2, new_y2]
logger.info(f"Converted {input_bbox} to {new_bbox}")
return new_bbox
"""
Thanks for developing such a fantastic tool for parsing and OCR, and for making it open source! Your efforts in creating such detailed and user-friendly software are truly appreciated! 🙌✨
Equation Parsing Issue: The extraction of equations within the "interline_equation" section is incorrect. I would like to know if there is a way to avoid extracting these equations as text and instead generate an image of the equation.
Header and Footer Extraction: Is there a configuration option available to extract the text and images from the headers and footers of the PDF?
Configuration Documentation: May you please tell me what what things we can change through config and how ?? I know some like table, model directory and device mode, But I want to know all.
Bounding Box Conversion for Camelot: I’ve been working on converting bounding boxes from MinerU to Camelot in order to extract table data. Do you have any idea how can we do it in better way. While the method works for some PDFs, it fails with others. Here’s the code I am using: """ def convert_mineru_to_camelot_bbox(input_bbox, pdf_path): width, height = get_camelot_page_width_height(pdf_path) x1, y1, x2, y2 = input_bbox new_y1 = height - y1 new_y2 = height - y2 new_bbox = [x1, new_y1, x2, new_y2] logger.info(f"Converted {input_bbox} to {new_bbox}") return new_bbox """
Thanks for developing such a fantastic tool for parsing and OCR, and for making it open source! Your efforts in creating such detailed and user-friendly software are truly appreciated! 🙌✨