pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.17k stars 495 forks source link

Remove_rotation() feature #3551

Closed Nageswarachand closed 3 months ago

Nageswarachand commented 3 months ago

Description of the bug

The set rotation feature rotates the page, but it couldn't provide proper results when extracting the details from the PDF page. However, in PyMuPDF version 1.24.3, there is a remove_rotation() feature. When using this feature to extract the line coordinates based on the width of the lines logic after changing the orientation with remove_rotation(), the width becomes zero or 1.0 for all lines in the PDF page.

can explain about this feature and this bug in remove_rotation( ) feature....

How to reproduce the bug

doc = fitz.open(pdf_path) page_no = 0 page = doc[page_no] page.remove_rotation() mediabox = page.mediabox width = mediabox.width height = mediabox.height orientation = page.rotation

print('width - ',width,'\n','height - ',height,'\n','orientation - ',orientation)

for filtering the lines i pdf pages

doc = fitz.open(pdf_path) beam_lines = {} for page_num in range(doc.page_count): page = doc[page_num] drawing_list=page.get_drawings() width_threshold = 0.10 length_threshold = 60 for item in drawing_list:

print(item)

        if item['width']:
            try:
                if item['items'][0][0] == 'l' and item['width']>width_threshold:

                    for subitem in item['items']:

                        if subitem[0]=='l' :
                            line_type, p0, p1 = subitem
                            lx0, ly0 = p0
                            lx1, ly1 = p1
                            line_length = math.hypot(lx1 - lx0, ly1 - ly0)
                            if line_length > length_threshold  and item['width'] > width_threshold: # Assume that column sides are not more than 12 and width of the line is also higher than 0.6. 
                                if page_num in beam_lines:
                                    beam_lines[page_num].append((p0,p1))
                                else:
                                    beam_lines[page_num] = [(p0,p1)]

                        else:

                            pass
            except ValueError as e:
                print(f"Unexpected data format: {item['items'][0]}, error: {str(e)}")

when filtering the lines based on width of line segment it gives output empty because width of all lines are zero, but before remove_rotation this code provide proper width for all lines

PyMuPDF version

1.24.3

Operating system

Windows

Python version

3.10

JorjMcKie commented 3 months ago

I don't understand what the problem is that you are reporting here. Please re-assess the situation as follows:

  1. Load the page, page = doc[page_no]
  2. Execute page.remove_rotation()
  3. Now extract stuff from the page, for instance text or drawings, or insert text, drawings or images.

If the behavior in point 3 is not as expected, only then we have a bug.

Nageswarachand commented 3 months ago

Clear explanation about the problem

Below code load the pdf input and use the get_drawings() feature to extract the width of the item. (orientation of this input is 90 degree)

doc = fitz.open(pdf_path)

orientation = doc[page_num].rotation
print("orientation ---", orientation)

for page_num in range(doc.page_count):
    page = doc[page_num]
    drawing_list=page.get_drawings()

for item in drawing_list[:15]:
    print(item['width'])

output: image

Below code load the pdf input , remove_rotation() remove the orientation of the input from 90 to 0 and use the get drawings feature to extract the width of the item.

doc = fitz.open(pdf_path)

for page_num in range(0,doc.page_count):
        doc[page_num].remove_rotation()

doc[page_num].rotation

for page_num in range(doc.page_count):
    page = doc[page_num]
    drawing_list=page.get_drawings()

for item in drawing_list[:15]:
    print(item['width'])

output: image

The problem is here , input pdf's used for both code is same, but item width is none or 0 for all. I have just printed 15 items in drawing_list, but it gives output 0 for all item width's when I use remove_rotation() feature.

Queries

  1. why this item width's are none for everything? (i perfomed some operations using this item width's , but it gave none for all items)
  2. i have used pymupdf version 1.24.3, is this version problem?

or else

  1. can you suggest any other feature in pymupdf? ( I don't want to rotate the page, i want to change the orientation of the page and respective coordinates of the page according to orientation change. Consider proper orientations are 0, 90,180,270)
JorjMcKie commented 3 months ago

Let me have the PDF please. This is required for following up

Nageswarachand commented 3 months ago

There is a input pdf.

Grace manor-mid rise floor and columnschedule.pdf

Please suggest me any other way or solve issue in this feature, because remove_rotation() features works fine but it provide's item width none or 0, it is necessary for my future operations in pdf.

JorjMcKie commented 3 months ago

I checked the results of page.remove_rotation(), and it does behave as expected. I can't I really understood your problem. But please be aware, that the results of page.get_drawings() with the rotated versus the derotated page may differ significantly. For example the first path of original page 0 (90° rotated) is this:

{'closePath': None,
 'color': None,
 'dashes': None,
 'even_odd': False,
 'fill': (1.0, 1.0, 1.0),
 'fill_opacity': 1.0,
 'items': [('re', Rect(938.3999633789062, 1509.5999755859375, 948.0, 1519.4400634765625), 1)],
 'layer': '',
 'lineCap': None,
 'lineJoin': None,
 'rect': Rect(938.3999633789062, 1509.5999755859375, 948.0, 1519.4400634765625),
 'seqno': 0,
 'stroke_opacity': None,
 'type': 'f',
 'width': None}

The same path after derotation of the page looks like this:

{'closePath': False,
  'color': None,
  'dashes': None,
  'even_odd': False,
  'fill': (1.0, 1.0, 1.0),
  'fill_opacity': 1.0,
  'items': [('l', Point(1504.5599365234375, 938.3999633789062), Point(1504.5599365234375, 948.0)),
            ('l', Point(1504.5599365234375, 948.0), Point(1514.4000244140625, 948.0)),
            ('l', Point(1514.4000244140625, 948.0), Point(1514.4000244140625, 938.3999633789062)),
            ('l', Point(1514.4000244140625, 938.3999633789062), Point(1504.5599365234375, 938.3999633789062))],
  'layer': '',
  'lineCap': None,
  'lineJoin': None,
  'rect': Rect(1504.5599365234375, 938.3999633789062, 1514.4000244140625, 948.0),
  'seqno': 0,
  'stroke_opacity': None,
  'type': 'f',
  'width': None}

Yet, both paths refer to the same path which you can see when multiplying the original paths[0]["rect"] with page.rotation_matrix ... which gives the coordinates as if the page were not rotated:

paths[0]["rect"] * page.rotation_matrix
Rect(1504.5599365234375, 938.3999633789062, 1514.4000244140625, 948.0)

This is visibly the same rectangle of the first path after de-rotation.

So maybe this is a way for you to circumvent the problem.