[Document Text Content] Support including text content character data

kariudo commented 3 weeks ago

Adds optional params to allow for returning character data when getting text content, as well as retention of spaces by force. This allows for consumers to calculate the size and location of individual words.

calixteman commented 3 weeks ago

What's the the problem you're trying to solve ? Please remember that the main goal of pdf.js is to be the Firefox pdf viewer and we don't want to have any code which will be a burden for us. So without a good reason to have such a feature in Firefox, there is no chance that we want to increase the memory use, decrease the performance when cloning the data from the worker to the main thread.

kariudo commented 3 weeks ago

What's the the problem you're trying to solve ? Please remember that the main goal of pdf.js is to be the Firefox pdf viewer and we don't want to have any code which will be a burden for us. So without a good reason to have such a feature in Firefox, there is no chance that we want to increase the memory use, decrease the performance when cloning the data from the worker to the main thread.

The problem being solved is having access to the character data (which has already been created), when invoking the API for getDocumentContent, the param as implemented does not do extra work unless requested by the API.

The primary reason we needed this was in order to support locating the position of words in the document, in the context of the document, rather than just the full strings of text. Doing it this way was the least impactful to pdf.js, by just optionally including values that were already computed.

No additional work or memory should be really used outside of someone opting in to the extra data via the API from the worker.

calixteman commented 3 weeks ago

Ok you're correct but as said we don't want to have to manage that stuff except if it's useful for Firefox.

kariudo commented 3 weeks ago

There are no shortage of issues that come up when searching for a means of accessing this data from pdf.js, not counting plenty of "sorry no way to do that" deadends across stackoverflow etc. However, here are a few quickly spotted issues related to why providing these optional paramaters is a "low cost" benefit:

In my specific use case, I of course have the code of the PR here already in use via the fork; however, it became clear when searching for a means to do it without code changes that ther was no good way to accomplish this right now and that other people have often gone looking for a means to do so. Hence the desire to expose the optional paramaters for other consumers of pdf.js.

The decision here to include it or not, is not mine, but I certainly see no harm in it as it adds no burden to existing consumers (Firefox itself), and solves several types of character/glyph/word/text extraction related issues by being able to opt in to this data for getDocumentContent.

Snuffleupagus commented 3 weeks ago

Sorry, but I agree with everything Calixte has already said above.

We (obviously) understand that this might be helpful in certain third-party use-cases, however we have to consider the overall code-base as well (since everything adds maintenance burden). Please note that we want to avoid adding (from our perspective) unused and untested code, since that not only increases the code-size and overhead (here by creating empty Arrays) but also "forces" us to spend time maintaining and supporting this (potentially for years) all for a feature we don't use.

MikeVensel commented 3 weeks ago

I have also struggled heavily to work around this issue and have resorted to using a fork and I am frustrated by this response.

It is a little funny to me that the README says, "PDF.js is community-driven and supported by Mozilla. Our goal is to create a general-purpose, web standards-based platform for parsing and rendering PDFs." No where in here does it say, " the main goal of pdf.js is to be the Firefox pdf viewer". It also doesn't seem very, "community" driven to reject a small PR that adds functionality that the community has clearly had a desire for here.

marco-c commented 2 weeks ago

@MikeVensel "community driven" doesn't mean the "community of users" drives pdf.js development, it means the "community of contributors" drives pdf.js development.

We can't just accept any PR that comes our way for small functionalities we don't care about, as the small functionalities accrue and maintenance becomes more costly, and we have a small team of core contributors.

If this PR was from an already known contributor, we would have accepted it. If you or @kariudo contribute other things we care about, we might accept such a PR more easily because we get to know you, we start to trust you, and by having help with maintenance the additional cost becomes bearable.

Also, the fact that the PR adds a new functionality without tests doesn't help: 1) it makes it easier for us to break it without noticing; 2) it makes us trust the contribution less, the absence of tests makes it look like just a quick code drop, leaving the maintenance burden to us.

mozilla / pdf.js

[Document Text Content] Support including text content character data #18239