mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.63k stars 10k forks source link

Is it possible to load local PDF file using chunks? #5304

Closed bh213 closed 3 years ago

bh213 commented 10 years ago

I'd like to iterate (but not render) all PDF pages of a file that is accessible using HTML5 File API but is large (from 500MB up to few GB).

Loading whole file using FileReader usually crashes the browser.

From my testing It seems that using DocumentInitParameters.data and pdfDataRangeTransport doesn't work as pdfDataRangeTransport seems to only work with DocumentInitParameters.url.

Is is possible to enable chunked loading for local files at all?

Thanks.

yurydelendik commented 10 years ago

Currently, even using chunk loading you are filling continuous buffer on the worker side. Our Stream implementation needs major refactoring to evict unused data and read from the non-continuous data structure. However even in current state that might help you to save memory, since the huge file will be stored once instead of at least two times (additional one on main thread before sending to the worker).

From my testing It seems that using DocumentInitParameters.data and pdfDataRangeTransport doesn't work as pdfDataRangeTransport seems to only work with DocumentInitParameters.url.

It shall work, since it's used from the extension. In the Firefox extension, the worker cannot request data directly but only via pdfDataRangeTransport. Could you prepare short example we can use for testing?

Personally I would like to see pdfDataRangeTransport interface redesigned (in addition to #5277), to make File API and any other custom chunk transfers easier (e.g. over WebRTC). And I can mentor and review PR related to that.

bh213 commented 10 years ago

I managed to get this to work using createObjectURL and custom PdfDataRangeTransport that reads from Blob. I'll try to make a short example.

Regarding allocating full PDF size - should that be fixed in ChunkedStream or are there other places where it would be needed. Thanks.

yurydelendik commented 10 years ago

Regarding allocating full PDF size - should that be fixed in ChunkedStream or are there other places where it would be needed.

Yeah, especially its makeSubStream and getBytes methods. Also, sometimes, the internal 'bytes' property is used (not sure if those place still exist, but we shall double check)

bh213 commented 10 years ago

Here is sample that reads chunks from local file (link because I cannot attach a zip file): https://dl.dropboxusercontent.com/u/12269315/pdf.js/sample.zip

Note that it seems that getDocument DocumentInitParameters needs 'length' parameter or it doesn't work (that is not in the docs).

I'll take a look at ChunkedStream now.

yurydelendik commented 10 years ago

@bh213 your example seems working. is there a particular issue we shall look for?

Note that it seems that getDocument DocumentInitParameters needs 'length' parameter or it doesn't work (that is not in the docs).

So you just saying to close this issue we just need to adjust the documentation to add 'length' documentation?

bh213 commented 10 years ago

Yes, I think just adding 'length' to the docs should do. I'll open another case for memory usage of ChunkedStream. Thanks.

kimar commented 9 years ago

Hey @bh213, could you reupload your working sample? Would be really interested in taking a look at this :+1:

bh213 commented 9 years ago

sorry, wrong referenced wrong PR. Funnily I am just updating the code that accesses local file, will let you know when it works.

bh213 commented 9 years ago

A bit late but here is an example of how to load local file:

https://jsfiddle.net/6wxnd9uu/6/

Snuffleupagus commented 3 years ago

Given first of all the somewhat specialized use-case outlined here, secondly that this issue hasn't seen a lot of activity during the years, and most importantly the sheer complexity of the patch attempted in PR #5332; I'm really not at all convinced that we should attempt/accept a patch which fundamentally re-writes the ChunkedStream/ChunkedStreamManager-implementations because of the regression risks involved (since it's crucial that this code works correctly).

Throughout the years, even a couple of small (and seemly safe) patches to the ChunkedStream/ChunkedStreamManager-code have caused breakage in real-world cases. Given the complexity of this code, it's very unlikely that e.g. unit-tests would be able to capture all aspects of this code well enough to ensure that it's sufficiently tested. A bigger re-factoring/re-write would thus put a lot of burden on first of all the reviewer, and secondly on the regular PDF.js contributors who have to maintain this code; all-in-all I'm thus suggesting that we WONTFIX this issue.

/cc @timvandermeij

timvandermeij commented 3 years ago

That seems fair to me indeed. Let's close this for now.