Closed bh213 closed 3 years ago
Currently, even using chunk loading you are filling continuous buffer on the worker side. Our Stream implementation needs major refactoring to evict unused data and read from the non-continuous data structure. However even in current state that might help you to save memory, since the huge file will be stored once instead of at least two times (additional one on main thread before sending to the worker).
From my testing It seems that using DocumentInitParameters.data and pdfDataRangeTransport doesn't work as pdfDataRangeTransport seems to only work with DocumentInitParameters.url.
It shall work, since it's used from the extension. In the Firefox extension, the worker cannot request data directly but only via pdfDataRangeTransport. Could you prepare short example we can use for testing?
Personally I would like to see pdfDataRangeTransport interface redesigned (in addition to #5277), to make File API and any other custom chunk transfers easier (e.g. over WebRTC). And I can mentor and review PR related to that.
I managed to get this to work using createObjectURL and custom PdfDataRangeTransport that reads from Blob. I'll try to make a short example.
Regarding allocating full PDF size - should that be fixed in ChunkedStream or are there other places where it would be needed. Thanks.
Regarding allocating full PDF size - should that be fixed in ChunkedStream or are there other places where it would be needed.
Yeah, especially its makeSubStream and getBytes methods. Also, sometimes, the internal 'bytes' property is used (not sure if those place still exist, but we shall double check)
Here is sample that reads chunks from local file (link because I cannot attach a zip file): https://dl.dropboxusercontent.com/u/12269315/pdf.js/sample.zip
Note that it seems that getDocument DocumentInitParameters needs 'length' parameter or it doesn't work (that is not in the docs).
I'll take a look at ChunkedStream now.
@bh213 your example seems working. is there a particular issue we shall look for?
Note that it seems that getDocument DocumentInitParameters needs 'length' parameter or it doesn't work (that is not in the docs).
So you just saying to close this issue we just need to adjust the documentation to add 'length' documentation?
Yes, I think just adding 'length' to the docs should do. I'll open another case for memory usage of ChunkedStream. Thanks.
Hey @bh213, could you reupload your working sample? Would be really interested in taking a look at this :+1:
sorry, wrong referenced wrong PR. Funnily I am just updating the code that accesses local file, will let you know when it works.
A bit late but here is an example of how to load local file:
Given first of all the somewhat specialized use-case outlined here, secondly that this issue hasn't seen a lot of activity during the years, and most importantly the sheer complexity of the patch attempted in PR #5332; I'm really not at all convinced that we should attempt/accept a patch which fundamentally re-writes the ChunkedStream
/ChunkedStreamManager
-implementations because of the regression risks involved (since it's crucial that this code works correctly).
Throughout the years, even a couple of small (and seemly safe) patches to the ChunkedStream
/ChunkedStreamManager
-code have caused breakage in real-world cases. Given the complexity of this code, it's very unlikely that e.g. unit-tests would be able to capture all aspects of this code well enough to ensure that it's sufficiently tested.
A bigger re-factoring/re-write would thus put a lot of burden on first of all the reviewer, and secondly on the regular PDF.js contributors who have to maintain this code; all-in-all I'm thus suggesting that we WONTFIX this issue.
/cc @timvandermeij
That seems fair to me indeed. Let's close this for now.
I'd like to iterate (but not render) all PDF pages of a file that is accessible using HTML5 File API but is large (from 500MB up to few GB).
Loading whole file using FileReader usually crashes the browser.
From my testing It seems that using DocumentInitParameters.data and pdfDataRangeTransport doesn't work as pdfDataRangeTransport seems to only work with DocumentInitParameters.url.
Is is possible to enable chunked loading for local files at all?
Thanks.