sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

`cosDocGetPageNumbers` crashes when there is no `PageLabels` in PDF catalog #39

Closed gwierzchowski closed 5 years ago

gwierzchowski commented 5 years ago

Hi. Working on Outline support implementation I got following error:

        @test begin
            filename="files/1.pdf"
            DEBUG && println(filename)
            doc = pdDocOpen(filename)
            @assert length(pdDocGetPageRange(doc, "1")) >= 1
            pdDocClose(doc)
            length(utilPrintOpenFiles()) == 0
        end

Fails with: MethodError: no method matching get(::CosNullType, ::CosName)

I think pdDocGetPageRange should fall back to parsing label as number and return appropriate page if there are no labels dictionary in PDF or at least return empty vector.

I have ready fix so would like to submit PR.

sambitdash commented 5 years ago

@gwierzchowski for the outlines the most important element is the Destination object in PDF Spec. Section 12.3.2. Although the spec is not as clear but what I can on a cursory glance gather the page described in section 12.3.2.2 should be a global page number. Did you see PageLabels being used there. In such cases, it may be apt to use the pdDocGetPage or pdDocGetPageRange with range of int as params than converting the page number to String and then querying the String through a pdDocGetPageRange.

sambitdash commented 5 years ago

@gwierzchowski on second thoughts, while an empty range may be fine as a return value or a better error message like invalid page label can be thrown as well, I do not think interpreting integers to ranges should be implemented. It does not have any meaning from the PDF spec perspective.

gwierzchowski commented 5 years ago

@sambitdash, Hi. Ad question1: Yes pages associate with outline items are little tricky and generally go into 2 (main) branches:

Doing second part I met common structure referred as name tree in PDF spec. Before going into coding, I did search if You maybe already implemented reading such structures and I found out that You actually do. Then I reviewed the code and found issues reported here. So I probably will not use directly the function pdPageGetPage or pdPageGetPageRange but I will use underlying find_ntree to read and search Names catalog entry. I may however need PageLabels as I want to associate outline item with appropriate page label (if page labels are present in document) - not have details yet, but will relay on my fixes. BTW: Unfortunately it takes some time because I have to sometimes put this work down for a while, so not working on this everyday.

Ad question2: I think it is good and convenient if numeric page number given as string be properly interpreted as valid page label. This is what actually most PDF viewers does (e.g. when you use goto-page function). Before my changes it depended on weather PDF had (optional) PageLabels entry, after the change it does not - what I think is desired. I think that function pdDocGetPageRange(doc::PDDoc, label::AbstractString) should more-less behave as viewers goto-page option and should not depend on internal PDF structure (i.e. existence of PageLabels entry). Technically, in the patch I return range(pgnum, length=1) because the function generally return a range: see comment before method find_page_label. I'm not sure if this has to be continue object (aka range), or maybe even it could be more general page numbers vector (if one may shift back page counter by respective entry in PageLabels) - maybe we can consider this.

sambitdash commented 5 years ago

@sambitdash, Hi. Ad question1: Yes pages associate with outline items are little tricky and generally go into 2 (main) branches:

  • using Dest key - what is easier part and I already have this done
  • using A key - what is harder and leads to GoTo actions and Names entry in catalog

Even here also everything finally gets mapped to the explicit destination kind of structure. Hence, if you keep a mapping of a outline to explicit destination you are most likely covering everything.

Doing second part I met common structure referred as name tree in PDF spec. Before going into coding, I did search if You maybe already implemented reading such structures and I found out that You actually do.

Yes ntree is the right structure to use. It's parametric for name as well as number tree based on the input type.

Then I reviewed the code and found issues reported here. So I probably will not use directly the function pdPageGetPage or pdPageGetPageRange but I will use underlying find_ntree to read and search Names catalog entry.

You can always use ntree to find the page numbers and then use pdPageGetPage or pdPageGetPageRange to get to the actual page structures. That keeps the APIs granular and reusable for other purposes as well.

I may however need PageLabels as I want to associate outline item with appropriate page label (if page labels are present in document) - not have details yet, but will relay on my fixes.

I am not sure if you can really get it done without parsing the PDF page content. PageLabels are not unique. One PageLabels entry can be mapped to more than one page. There is also no guarantee that the PageLabel used is actually shown on the content stream as well. So outline does not necessarily match the document Table of Contents. They mostly match but PDF specification does not mandate it.

BTW: Unfortunately it takes some time because I have to sometimes put this work down for a while, so not working on this everyday.

No problems we all are in the same boat on this :-). I am also not pressed for time on this feature. So you can implement as time permits.

Ad question2: I think it is good and convenient if numeric page number given as string be properly interpreted as valid page label. This is what actually most PDF viewers does (e.g. when you use goto-page function). Before my changes it depended on weather PDF had (optional) PageLabels entry, after the change it does not - what I think is desired. I think that function pdDocGetPageRange(doc::PDDoc, label::AbstractString) should more-less behave as viewers goto-page option and should not depend on internal PDF structure (i.e. existence of PageLabels entry).

I think this is where we need to keep APIs and viewer behavior different. APIs are meant for granular control and absolutely predictive behavior. A consumer of API should use the APIs to achieve more usable behavior for his application. The low level APIs, as these are currently, should not be overloaded to introduce alternate interpretations. Consumers based on the return value or errors thrown can take the corrective actions and invoke the APIs properly to get the desired output. Throwing an error is absolutely acceptable API behavior as long as the thrown error is something clearly decipherable to take corrective action. The way the APIs are in PDFIO today, we expect the consumer to understand PDF specification to some extent. In future we may introduce more task oriented APIs, which keep PDF specification transparent to the end consumer. #38 is another case in the same direction. This approach ensures the lower level APIs become robust and practically can be used for any task without needing additional parameter overloading to enhance or reduce behavior.

Technically, in the patch I return range(pgnum, length=1) because the function generally return a range: see comment before method find_page_label. I'm not sure if this has to be continue object (aka range), or maybe even it could be more general page numbers vector (if one may shift back page counter by respective entry in PageLabels) - maybe we can consider this.

PageLabels to my knowledge are contiguous from the way it's defined in the catalog object. However, there is a possibility that the same PageLabel may represent multiple contiguous ranges. Particularly, if you are labeling a range of documents as hidden say. I have not interpreted the spec to this details though. So the output can be a dis-contiguous range (in case Julia supports this) or multiple contiguous ranges. I have kept the API to return AbstractRange for this reason.

gwierzchowski commented 5 years ago

Ad question1:

I am not sure if you can really get it done without parsing the PDF page content. PageLabels are not unique. One PageLabels entry can be mapped to more than one page. There is also no guarantee that the PageLabel used is actually shown on the content stream as well. So outline does not necessarily match the document "Table of Contents".

Yes, understand this. Object that i plan return from pdDocGetOutline will contain a doctrinaires associated with each outline position, and every such dictionary will actually have 3 keys related to the page: item[:PageRef], item[:PageNo], item[:PageLabel], which stands for page direct reference (object of type CosIndirectObjectRef), raw numeric page number and page label. First 2 keys will always be present, and the last key will be optional end will not exist if PDF does not have PageLabels entry. So API caller can use first 2 keys to get to the proper page, and may use last item (if exists) to display (like in TOC). PS: Just now I came to idea: maybe it would be worth to add new API pdDocGetPageLabel(doc, pgnum::Int)::Union{AbstractString, Nothing}.

Ad question2:

I think this is where we need to keep APIs and viewer behavior different. APIs are meant for granular control and absolutely predictive behavior. A consumer of API should use the APIs to achieve more usable behavior for his application. The low level APIs, as these are currently, should not be overloaded to introduce alternate interpretations.

My feeling was that: cos* functions and types is this kind of low level API which user use having PDF spec at hand, while pdDoc* is higher / functional level API for which user does not necessarily need to have a knowledge about PDF internals. But of course you are the author so have privilege to drive how we do the things :). But maybe following this cos/pd distinction move the tryparse code from cosDocGetPageNumbers to pdDocGetPageRange so the first function would signal error if labels are not there, and last higher level function will catch this and try to fallback to page if passed string can be converted to number (it will still raise error if requested page is beyond existing pages)?

Consumers based on the return value or errors thrown can take the corrective actions and invoke the APIs properly to get the desired output. Throwing an error is absolutely acceptable API behavior as long as the thrown error is something clearly decipherable to take corrective action.

Little bit disagree. I'm doing programming in C++ and Java for quite bit time, and was always educated to refrain from throwing exceptions to signal any kind of situation that can normally happen (in regular program flow) - like something is not present or not found etc. Exceptions should only be used when something is really breaking and abnormal - like input does not conform to declared specification, system error, required file is missing etc. This is because 2 main reasons:

In this case the functions cosDocGetPageNumbers or pdDocGetPageRange are find-like functions, and caller may not know if required label/page does exist in document. So one may consider turning those functions to return empty range or empty vector (second function) when page cannot be found. In general at minimum cosDocGetPageNumbersshould be changed a bit to not raise 'MethodError exception.

So in summary - please propose in which direction I should change the code. I agree that maybe cos* family function should not do too much tricks.

sambitdash commented 5 years ago
  1. On pdDocGetOutline, I will be able to comment when I see the complete design or a significant part of the implementation.
  2. If I were to look at a complete system as a Model-View-Controller (MVC) abstraction, PDFIO represents the model and not viewer. From that viewpoint, I do not see any reason global page numbers be mapped to a string and deciphered back to integer. While that behavior may be desirable for a viewer as go to page needs to be represented as a textedit conrol in most UI frameworks, it's not a requirement for an API of PDF model and not true for PDF specification. An additional method pdDocHasPageLevels as suggested in #41 can address possible scenarios in most cases.
  3. Performance penalty of exception handling of system programming language like C++ or SEH in Windows is very different from that of application programming. Exception handling on errors is a much cleaner representation in most cases than introducing additional control flows in normal programming logic. I am in favor of throwing an E_INVALID_PAGE_LABEL in case the label is not found similar to E_INVALID_PAGE_NUMBER as I do not necessarily see any difference between both.
gwierzchowski commented 5 years ago

Ok, So I'll propose following to move on:

I'd like to close this item then merge final code to my Outlines branch to move on with this staff

sambitdash commented 5 years ago

Ok, So I'll propose following to move on:

  • Add E_INVALID_PAGE_LABEL to errors.ji as you suggested
  • Replace E_INVALID_PAGE_NUMBER with E_INVALID_PAGE_LABEL in find_page_label as this function actually looks for labels

Not sure if that's true for all cases. But, wherever it's relevant to labels the error should be switched.

  • Remove tryparse from cosDocGetPageNumbers and stick with code:
    if ref === CosNull
        throw(ErrorException(XXX))
    end

ref === CosNull && throw(ErrorException(XXX)) may be a better syntax

Here is a question, I think more informative would be here adding new kind of error: E_LABELS_NOT_PRESENT. So let me know if add and use this new error in place of XXX or throw here E_INVALID_PAGE_LABEL.

Too many error messages of similar kind creates more confusion. Hence, if there is no significant loss in meaning E_INVALID_PAGE_LABEL is just fine.

  • No changes in 'pdDocGetPageRange' - i.e. no fallback to tryparse at all.
  • Implement pdDocHasPageLabels() as of #41

I'd like to close this item then merge final code to my Outlines branch to move on with this staff

gwierzchowski commented 5 years ago

Hi, I have committed changes to PR #40 - please review. Thanks.