mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.63k stars 10k forks source link

Text selection is not remembered when scrolling #18221

Open kekkc opened 5 months ago

kekkc commented 5 months ago

https://mozilla.github.io/pdf.js/web/viewer.html

Steps to reproduce the problem:

  1. Select some text (e.g.

Dynamic languages such as JavaScript are more difficult to com-

  1. Scroll to the last page of the pdf and then scroll to the top again

What is the expected behavior? Selection is still marked in blue (so that I can see where I stopped reading)

What went wrong? Selection is gone, no text is selected any more

Adarsh01208 commented 5 months ago

I have also faced this issue

Legus-Yeung commented 5 months ago

When I replicate this issue, I highlighted the text on the first page (1) and the words will stay highlighted as long as I don't scroll past page (8). If I highlight on the first page and scroll to page 9 and onwards, the highlighted text stays.

https://mozilla.github.io/pdf.js/web/viewer.html

This makes me think that pdf viewer will "forget/unload" the highlighted text as you render more pages in order to preserve memory. So it might not be a static occurrence from scrolling 8 pages away, if the pages in between have a lot of content like HD pictures, the highlighted text might disappear sooner than 8 pages.

Suggested solution: Create a variable or a sidebar with a functionality to store the highlighted text, this may be very useful for people who are doing research on a pdf.

borontion commented 3 months ago

There is a buffer for page views and views exceed the buffer size will be destroyed

https://github.com/mozilla/pdf.js/blob/7a8aceef20925dc082a8f1f863134958ceb46015/web/pdf_viewer.js#L147-L149

The buffer size will be 10 for this document

https://github.com/mozilla/pdf.js/blob/7a8aceef20925dc082a8f1f863134958ceb46015/web/pdf_viewer.js#L1634

kekkc commented 3 months ago

Nice catch, but there seems to be no ease way to disable this buffer.

BTW: just tested with Chrome and it preserves the selection and also the pages. Even with a 500 pages PDF Chrome seems to using no cache or buffer at all (a usual website with 500 pages also has no buffer).

QuantumAbraham commented 2 months ago

BTW: just tested with Chrome and it preserves the selection and also the pages. Even with a 500 pages PDF Chrome seems to using no cache or buffer at all (a usual website with 500 pages also has no buffer).

Yes the issue is not present in chrome but they do use buffers, its only that the text selection is preserved

kekkc commented 2 months ago

Yes the issue is not present in chrome but they do use buffers, its only that the text selection is preserved

There's no delay in Chrome for 500 pages, so I assumed that there's also no buffer (in contrast there's always the loading icon in FF and I cannot see the content while fast scrolling through the 500 pages in 2 seconds).

kekkc commented 2 months ago

Update: seems Chrome's PDFium is using caching (https://groups.google.com/g/pdfium/c/_zi2mMDiyjo), but only within C++. Javascript has it's own garbage collection within FF already. I really see no need here to stick to this 15+ years old, performance impacting buffer.

calixteman commented 2 months ago

Update: seems Chrome's PDFium is using caching (https://groups.google.com/g/pdfium/c/_zi2mMDiyjo), but only within C++. Javascript has it's own garbage collection within FF already. I really see no need here to stick to this 15+ years old, performance impacting buffer.

@kekkc, could you elaborate about performance impacting buffer ?

Why removing the PDFPageViewBuffer would fix the text selection issue ?

Snuffleupagus commented 2 months ago

Please note that PDFPageViewBuffer exists for good reason, and it cannot just be removed since that'd cause significant problems in general:

kekkc commented 2 months ago

@calixteman Scrolling performance is bad. If I'm at page 500 and scroll up fast to the top, I only see the loading icon on the pages. E.g. for pdfium I still see the content and images while scrolling and can stop eventually if I found what I was looking for. If there would no buffer, no pages would be deleted in RAM and also no selection would be deleted.

@Snuffleupagus

It would affect memory usage very badly in long PDF documents, i.e. ones consisting of hundreds or thousands of pages.

would love to test this and to set the buffer to a trillion pages, but unfortunately build env doc is not listing all dependencies like python that I cannot install directly here. Would be awesome if someone would have a https://mozilla.github.io/pdf.js/web/viewer.html link to test this in real world.

It would affect overall performance really badly, particularly in long PDF documents, since it'd allow an arbitrary number of pages to compete for parsing and rendering resources which could easily cause the viewer to become unusable.

I don't see why this is different from any other usual webpage in Firefox. If a website has 1 trillion pages and the user has thousand tabs open, FF and the website including pdf.js might become unresponsive. But FF wouldn't just start to "flush" the first pages of the active website in use just because it is big. And even if it would, it should be the decision up to the browser how to distribute available parsing and rendering resources for the active website across tabs, CPU threads and garbage collection.

calixteman commented 2 months ago

Sorry but you didn't explain me why removing the buffer will improve performances. Pdfium is usually faster because it's written in C++ when pdf.js is in pure javascript and we use almost only what JS/CSS/HTML provide. That said in term of security, pdf.js is a way better than pdfium because it's written in js when pdfium is written in C++.

Even if removing the buffer will fix the selection issue, it'll introduce some problems as mentioned by @Snuffleupagus which are worst than the selection one, so we won't do that. The solution for the original problem (the selection one) is likely something around caching the selection when the page is destroyed and restoring it when the page is shown again.

I'm not sure to understand exactly what you mean by "other usual webpage"... For example, if you open X (aka Twitter), there is no real concept of "page", but there's only one "giant" page you can scroll for ever and of course this "page" isn't fully rendered else it'll consume all the resources of your computer but it's up to the web devs to make this possible in showing only what the user is seeing.

That said, don't worry to use Pdfium if you think it's better for your use case, or if you want to use pdf.js, it's a free and an open-source software you can contribute to, so don't worry to make a PR to either improve performance or improve text selection.

kekkc commented 2 months ago

That said in term of security, pdf.js is a way better than pdfium because it's written in js when pdfium is written in C++.

Indeed that's the main reason why I replaced any "native" local pdf reader with pdf.js in FF (possible after forms support), which is the only one that uses the sharp sub-pixel rendering from FF on HiDPI ;) Subjectively I feel no real performance penalty for rendering the first time using JS compared to C++ / Rust.

I'm not sure to understand exactly what you mean by "other usual webpage"...

Yes, for infinite scrolling a Webdev is responsible that the infinity will always work. But PDFs & static websites are not infinite and the end & size is known. Even for the current recordholder and longest static website worldwide, no browser would simply delete the first pages when you scrolled to the end, there's also no performance penalty (any JS operation can be executed min. 15.000 times every millisecond, even more on current ARM chips!): http://www.recordholders.org/de/laengste/lang-a.htm

However, I did see unresponsiveness with PDFs that only have a few pages, but hundreds of Mb images. Browser as well as pdf.js were unresponsive, but of course the cache buffer doesn't prevent that and even "native" PDF readers became unresponsive there.

The solution for the original problem (the selection one) is likely something around caching the selection when the page is destroyed and restoring it when the page is shown again.

Or increasing the buffer to 10.000 pages ;) Seriously, are there tests available that showed that this buffer is needed or why this page size is reasonable? Build environment is failing for me, because of the same Mac OS node-pre-gyp errors on Windows & Linux VM here, but if someone could provide a pdf.js/web/viewer.html link with adjusted cache size, I could do some testing with world's most famous books for CPU time, threads & RAM:

If all fails of course and there is a real world reason for the buffer, caching the selection would be cool, since I feel it's the only disadvantage left compared to native PDF readers.

kekkc commented 2 months ago

Was able to test in a separate VM: buffer is still used for loading new pages, but "destroying" the old pages is disabled. Performance is similar as with native PDF readers & scrolling to the top shows the actual content. No problem with the big PDFs above (if someone retests, min. 2 pages should be shown on the screen while scrolling). Can do some extensive comparisons if there's interest to go this route eventually. Only strange thing I realized while fast scrolling was that there's always 20% of the screen on top grey, guess because FF is repainting.

The selection is of course preserved. Removed the following lines: https://github.com/mozilla/pdf.js/blob/7a8aceef20925dc082a8f1f863134958ceb46015/web/pdf_viewer.js#L147-L149 https://github.com/mozilla/pdf.js/blob/7a8aceef20925dc082a8f1f863134958ceb46015/web/pdf_viewer.js#L177-L179 https://github.com/mozilla/pdf.js/blob/7a8aceef20925dc082a8f1f863134958ceb46015/web/pdf_viewer.js#L190-L195

kekkc commented 2 months ago

Did a PR, bottom line is IMO: no need to cache selections, just pages with selections shouldn't be destroyed. Or easier, no page should be destroyed if there's no specific reason as with thumbnail pages, or there could be a setting in FF.

QuantumAbraham commented 2 months ago

@calixteman I agree with you on the solution you suggested for this issue, I've been thinking about it myself. Either we:

  1. Persist Text Selections: To cache text selections separately from the page buffer. This would involve saving the selection data when a page is destroyed and restoring it when the page is reloaded. OR
  2. Page Destruction Strategy: We modify the logic to prevent pages with selections from being destroyed. Implement a setting to adjust how pdf.js handles page buffering and destruction.

@Snuffleupagus and @kekkc what do y'all think about this?