Closed gwierzchowski closed 5 years ago
@gwierzchowski for the outlines the most important element is the Destination object in PDF Spec. Section 12.3.2. Although the spec is not as clear but what I can on a cursory glance gather the page
described in section 12.3.2.2 should be a global page number. Did you see PageLabels
being used there. In such cases, it may be apt to use the pdDocGetPage
or pdDocGetPageRange
with range of int as params than converting the page number to String
and then querying the String
through a pdDocGetPageRange
.
@gwierzchowski on second thoughts, while an empty range may be fine as a return value or a better error message like invalid page label
can be thrown as well, I do not think interpreting integers to ranges should be implemented. It does not have any meaning from the PDF spec perspective.
@sambitdash, Hi. Ad question1: Yes pages associate with outline items are little tricky and generally go into 2 (main) branches:
Dest
key - what is easier part and I already have this doneA
key - what is harder and leads to GoTo actions and Names
entry in catalogDoing second part I met common structure referred as name tree
in PDF spec. Before going into coding, I did search if You maybe already implemented reading such structures and I found out that You actually do.
Then I reviewed the code and found issues reported here.
So I probably will not use directly the function pdPageGetPage
or pdPageGetPageRange
but I will use underlying find_ntree
to read and search Names
catalog entry. I may however need PageLabels
as I want to associate outline item with appropriate page label (if page labels are present in document) - not have details yet, but will relay on my fixes.
BTW: Unfortunately it takes some time because I have to sometimes put this work down for a while, so not working on this everyday.
Ad question2:
I think it is good and convenient if numeric page number given as string be properly interpreted as valid page label. This is what actually most PDF viewers does (e.g. when you use goto-page function).
Before my changes it depended on weather PDF had (optional) PageLabels
entry, after the change it does not - what I think is desired. I think that function pdDocGetPageRange(doc::PDDoc, label::AbstractString)
should more-less behave as viewers goto-page option and should not depend on internal PDF structure (i.e. existence of PageLabels
entry).
Technically, in the patch I return range(pgnum, length=1)
because the function generally return a range: see comment before method find_page_label
. I'm not sure if this has to be continue object (aka range), or maybe even it could be more general page numbers vector (if one may shift back page counter by respective entry in PageLabels
) - maybe we can consider this.
@sambitdash, Hi. Ad question1: Yes pages associate with outline items are little tricky and generally go into 2 (main) branches:
- using
Dest
key - what is easier part and I already have this done- using
A
key - what is harder and leads to GoTo actions andNames
entry in catalog
Even here also everything finally gets mapped to the explicit destination kind of structure. Hence, if you keep a mapping of a outline
to explicit destination
you are most likely covering everything.
Doing second part I met common structure referred as
name tree
in PDF spec. Before going into coding, I did search if You maybe already implemented reading such structures and I found out that You actually do.
Yes ntree
is the right structure to use. It's parametric for name
as well as number tree
based on the input type.
Then I reviewed the code and found issues reported here. So I probably will not use directly the function
pdPageGetPage
orpdPageGetPageRange
but I will use underlyingfind_ntree
to read and searchNames
catalog entry.
You can always use ntree
to find the page numbers and then use pdPageGetPage
or pdPageGetPageRange
to get to the actual page structures. That keeps the APIs granular and reusable for other purposes as well.
I may however need
PageLabels
as I want to associate outline item with appropriate page label (if page labels are present in document) - not have details yet, but will relay on my fixes.
I am not sure if you can really get it done without parsing the PDF page content. PageLabels
are not unique. One PageLabels
entry can be mapped to more than one page. There is also no guarantee that the PageLabel
used is actually shown on the content stream as well. So outline
does not necessarily match the document Table of Contents
. They mostly match but PDF specification does not mandate it.
BTW: Unfortunately it takes some time because I have to sometimes put this work down for a while, so not working on this everyday.
No problems we all are in the same boat on this :-). I am also not pressed for time on this feature. So you can implement as time permits.
Ad question2: I think it is good and convenient if numeric page number given as string be properly interpreted as valid page label. This is what actually most PDF viewers does (e.g. when you use goto-page function). Before my changes it depended on weather PDF had (optional)
PageLabels
entry, after the change it does not - what I think is desired. I think that functionpdDocGetPageRange(doc::PDDoc, label::AbstractString)
should more-less behave as viewers goto-page option and should not depend on internal PDF structure (i.e. existence ofPageLabels
entry).
I think this is where we need to keep APIs and viewer behavior different. APIs are meant for granular control and absolutely predictive behavior. A consumer of API should use the APIs to achieve more usable behavior for his application. The low level APIs, as these are currently, should not be overloaded to introduce alternate interpretations. Consumers based on the return value or errors thrown can take the corrective actions and invoke the APIs properly to get the desired output. Throwing an error is absolutely acceptable API behavior as long as the thrown error is something clearly decipherable to take corrective action. The way the APIs are in PDFIO
today, we expect the consumer to understand PDF specification to some extent. In future we may introduce more task oriented APIs, which keep PDF specification transparent to the end consumer. #38 is another case in the same direction. This approach ensures the lower level APIs become robust and practically can be used for any task without needing additional parameter overloading to enhance or reduce behavior.
Technically, in the patch I
return range(pgnum, length=1)
because the function generally return a range: see comment before methodfind_page_label
. I'm not sure if this has to be continue object (aka range), or maybe even it could be more general page numbers vector (if one may shift back page counter by respective entry inPageLabels
) - maybe we can consider this.
PageLabels
to my knowledge are contiguous from the way it's defined in the catalog object. However, there is a possibility that the same PageLabel
may represent multiple contiguous ranges. Particularly, if you are labeling a range of documents as hidden
say. I have not interpreted the spec to this details though. So the output can be a dis-contiguous range (in case Julia supports this) or multiple contiguous ranges. I have kept the API to return AbstractRange
for this reason.
Ad question1:
I am not sure if you can really get it done without parsing the PDF page content.
PageLabels
are not unique. OnePageLabels
entry can be mapped to more than one page. There is also no guarantee that thePageLabel
used is actually shown on the content stream as well. So outline does not necessarily match the document "Table of Contents".
Yes, understand this. Object that i plan return from pdDocGetOutline
will contain a doctrinaires associated with each outline position, and every such dictionary will actually have 3 keys related to the page: item[:PageRef], item[:PageNo], item[:PageLabel]
, which stands for page direct reference (object of type CosIndirectObjectRef
), raw numeric page number and page label. First 2 keys will always be present, and the last key will be optional end will not exist if PDF does not have PageLabels
entry. So API caller can use first 2 keys to get to the proper page, and may use last item (if exists) to display (like in TOC).
PS: Just now I came to idea: maybe it would be worth to add new API pdDocGetPageLabel(doc, pgnum::Int)::Union{AbstractString, Nothing}
.
Ad question2:
I think this is where we need to keep APIs and viewer behavior different. APIs are meant for granular control and absolutely predictive behavior. A consumer of API should use the APIs to achieve more usable behavior for his application. The low level APIs, as these are currently, should not be overloaded to introduce alternate interpretations.
My feeling was that: cos*
functions and types is this kind of low level API which user use having PDF spec at hand, while pdDoc*
is higher / functional level API for which user does not necessarily need to have a knowledge about PDF internals. But of course you are the author so have privilege to drive how we do the things :). But maybe following this cos/pd distinction move the tryparse
code from cosDocGetPageNumbers
to pdDocGetPageRange
so the first function would signal error if labels are not there, and last higher level function will catch this and try to fallback to page if passed string can be converted to number (it will still raise error if requested page is beyond existing pages)?
Consumers based on the return value or errors thrown can take the corrective actions and invoke the APIs properly to get the desired output. Throwing an error is absolutely acceptable API behavior as long as the thrown error is something clearly decipherable to take corrective action.
Little bit disagree. I'm doing programming in C++ and Java for quite bit time, and was always educated to refrain from throwing exceptions to signal any kind of situation that can normally happen (in regular program flow) - like something is not present or not found etc. Exceptions should only be used when something is really breaking and abnormal - like input does not conform to declared specification, system error, required file is missing etc. This is because 2 main reasons:
In this case the functions cosDocGetPageNumbers
or pdDocGetPageRange
are find-like functions, and caller may not know if required label/page does exist in document. So one may consider turning those functions to return empty range or empty vector (second function) when page cannot be found. In general at minimum cosDocGetPageNumbers
should be changed a bit to not raise 'MethodError
exception.
So in summary - please propose in which direction I should change the code. I agree that maybe cos*
family function should not do too much tricks.
pdDocGetOutline
, I will be able to comment when I see the complete design or a significant part of the implementation. PDFIO
represents the model and not viewer. From that viewpoint, I do not see any reason global page numbers be mapped to a string and deciphered back to integer. While that behavior may be desirable for a viewer as go to page needs to be represented as a textedit conrol
in most UI frameworks, it's not a requirement for an API of PDF model and not true for PDF specification. An additional method pdDocHasPageLevels
as suggested in #41 can address possible scenarios in most cases.E_INVALID_PAGE_LABEL
in case the label
is not found similar to E_INVALID_PAGE_NUMBER
as I do not necessarily see any difference between both. Ok, So I'll propose following to move on:
E_INVALID_PAGE_LABEL
to errors.ji as you suggestedE_INVALID_PAGE_NUMBER
with E_INVALID_PAGE_LABEL
in find_page_label
as this function actually looks for labelstryparse
from cosDocGetPageNumbers
and stick with code:
if ref === CosNull
throw(ErrorException(XXX))
end
Here is a question, I think more informative would be here adding new kind of error: E_LABELS_NOT_PRESENT. So let me know if add and use this new error in place of XXX or throw here E_INVALID_PAGE_LABEL.
tryparse
at all.pdDocHasPageLabels()
as of #41I'd like to close this item then merge final code to my Outlines branch to move on with this staff
Ok, So I'll propose following to move on:
- Add
E_INVALID_PAGE_LABEL
to errors.ji as you suggested- Replace
E_INVALID_PAGE_NUMBER
withE_INVALID_PAGE_LABEL
infind_page_label
as this function actually looks for labels
Not sure if that's true for all cases. But, wherever it's relevant to labels the error should be switched.
- Remove
tryparse
fromcosDocGetPageNumbers
and stick with code:if ref === CosNull throw(ErrorException(XXX)) end
ref === CosNull && throw(ErrorException(XXX))
may be a better syntax
Here is a question, I think more informative would be here adding new kind of error: E_LABELS_NOT_PRESENT. So let me know if add and use this new error in place of XXX or throw here E_INVALID_PAGE_LABEL.
Too many error messages of similar kind creates more confusion. Hence, if there is no significant loss in meaning E_INVALID_PAGE_LABEL
is just fine.
- No changes in 'pdDocGetPageRange' - i.e. no fallback to
tryparse
at all.- Implement
pdDocHasPageLabels()
as of #41I'd like to close this item then merge final code to my Outlines branch to move on with this staff
Hi, I have committed changes to PR #40 - please review. Thanks.
Hi. Working on Outline support implementation I got following error:
Fails with:
MethodError: no method matching get(::CosNullType, ::CosName)
I think
pdDocGetPageRange
should fall back to parsing label as number and return appropriate page if there are no labels dictionary in PDF or at least return empty vector.I have ready fix so would like to submit PR.