qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
344 stars 29 forks source link

IndexError: list index out of range (slopes[region_idx]) #45

Closed andbue closed 3 years ago

andbue commented 3 years ago

Hi, when running Eynollah on this image, using

en = Eynollah("models_eynollah", imgfile, dir_out=path.split(imgfile)[0], curved_line=True, full_layout=True)
pcgts = en.run()

it fails with

Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File "venv/lib/python3.6/site-packages/qurator/eynollah/eynollah.py", line 2074, in run
    pcgts = self.writer.build_pagexml_full_layout(contours_only_text_parent, contours_only_text_parent_h, page_coord, order_text_new, id_of_texts_tot, all_found_texline_polygons, all_found_texline_polygons_h, all_box_coord, all_box_coord_h, polygons_of_images, polygons_of_tabels, polygons_of_drop_capitals, polygons_of_marginals, all_found_texline_polygons_marginals, all_box_coord_marginals, slopes, slopes_marginals, cont_page, polygons_lines_xml)
  File "venv/lib/python3.6/site-packages/qurator/eynollah/writer.py", line 221, in build_pagexml_full_layout
    self.serialize_lines_in_region(textregion, all_found_texline_polygons_h, mm, page_coord, all_box_coord_h, slopes, counter)
  File "venv/lib/python3.6/site-packages/qurator/eynollah/writer.py", line 117, in serialize_lines_in_region
    if self.curved_line and np.abs(slopes[region_idx]) <= 45:
IndexError: list index out of range

Some other pages from the book seem to work, the results are looking really good (except for drop caps, but they are not that easy to identify and put in the correct order for humans as well). I'm segmenting the rest of the book now and will see if there are more errors like that one.

cneud commented 3 years ago

Thanks for reporting and providing a test image, we will look into this asap.

Regarding drop caps, I assume you are already aware that recognition of drop caps can be enabled by the -fl flag?

vahidrezanezhad commented 3 years ago

Dear @andbue , check that you are using the latest version of eynollah. I have already checked your test document and it goes through.

andbue commented 3 years ago

Hi @vahidrezanezhad, thanks for having a look into this! I set up a new venv, downloaded the models with make models, made sure to check out the latest master and testet from the CLI now.

The image I linked to earlier actually ran through. Then I realized that at first I had downloaded the same image not as JPG, but as PNG from https://api.digitale-sammlungen.de/iiif/image/v2/bsb00052981_00019/full/full/0/default.png, hoping for better quality without compression – and this image reliably fails for me when I run eynollah -i default.png -o . -m eynollah/models_eynollah -fl -cl. Sorry for posting the wrong link, I would have never thought this could make a difference here!

Regarding the drop caps, I've set the full_layout flag, but it doesn't really help, unfortunately: half of the printed and afterwards manually painted initials here are labeled as "ImageRegion", the others are integrated into the next paragraph.

vahidrezanezhad commented 3 years ago

Hi @andbue :) I have already fixed the issue and the PR will be merged soon. @kba

About the other points you have mentioned:

1- "Initials for the linked image are labeled as ImageRegion": The problem is that in our GT handwrittens are labeled as ImageRegion . And in current models, I let them be ImageRegion. The point is your initials are not handwritten but they really look like handwritten. That is why the model confuses and segments them as ImageRegion. 2- Another point is why they are not segmented as drop capitals or why in general drop capital are not detected well. The answer is the variety of drop capitals in our GT is low.

To resolve those issues we need to boost GT and relabel handwrittens.

andbue commented 3 years ago

Amazing, thank you!

Maybe it would help with the initials if users could suppress the recognition of ImageRegions for books without images? Collecting more GT data is, of course, always the best option!

vahidrezanezhad commented 3 years ago

text_but_image2

vahidrezanezhad commented 3 years ago

text_but_image_or_graphic

andbue commented 3 years ago

I understand, this is really too similar to that kind of drop capitals!

vahidrezanezhad commented 3 years ago

I understand, this is really too similar to that kind of drop capitals!

yes it is. but we will boost our models soon :)