huridocs / uwazi

Uwazi is a web-based, open-source solution for building and sharing document collections
http://www.uwazi.io
MIT License
246 stars 80 forks source link

"triple clicking" a PDF text gives one selection feedback and different selection rectangles #4971

Open RafaPolit opened 2 years ago

RafaPolit commented 2 years ago

Describe the bug Triple clicking a line in the PDF visually selects only one row (a title for example), but upon creating a text reference, the selection rectangles sent actually encompass the entire page.

To Reproduce Steps to reproduce the behavior:

  1. Go to a document and triple click on a text to select the line
  2. Confirm that the "blueish" selection looks like a single line
  3. Create a reference to another entity
  4. The highlighted area is the entire page, and the network call confirms that the client sent very many rectangle selections from all around the page

Expected behavior The sent rectangles should match the selected area.

This affects the "click to fill" flow as well

Additional context Both of the PDFs from the E2E puppeteer fixtures display this behavior in case this is somehow PDF dependent. This may just be a bug with our boundary rectangle detecting library, so this may be out of our control. If so, please create a bug report with the developers of the library.

LaszloKecskes commented 2 years ago

Additional context as to why the click-to-fill matters: since we use the click-to-fill rectangles as an input to the information extraction service, this could cause the machine learning algorithm to get wrong training examples.