pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
63 stars 2 forks source link

Splitting a content stream in the middle of a string #208

Open pesco opened 1 year ago

pesco commented 1 year ago

When a content stream is split across multiple stream objects by specifying an array in a page's Contents entry, the description of Contents in Table 31 states:

If the value is an array, the effect shall be as if all of the streams in the array were concatenated with at least one white-space character added between the streams' data, in order, to form a single stream.

(Emphasis mine.) This sounds sensible. It ensures that the stream boundaries coincide with token boundaries. However, it leaves it up to the implementation to decide which whitespace character(s) to "imagine" between the streams. That would be fine if all whitespace was treated the same between tokens, but as far as I can tell from 7.2 "Lexical conventions" and 7.3.4 "String objects", a literal string such as (Hello World) is not considered a single token.

Therefore, a content stream as in the following appears to be not well-defined:

4 0 obj
    << /Type /Page /Contents [5 0 R 8 0 R] ... >>
endobj

5 0 obj
    << /Length 30 >>
stream
BT /F1 24 Tf 100 100 Td (Hello
endstream
endobj                                                                                                                                                                    

8 0 obj
    << /Length 12 >>
stream
World) Tj ET
endstream
endobj

Is the string in question Hello World, Hello\nWorld, or Hello \n \n \n\n\nWorld? Or is the above illegal somehow?

Recommendation for correction

Clearly define what is to be inserted when joining the streams. E.g. replace the phrase "at least one white-space character" in Table 31 with "exactly one SPACE (20h) character".

Additional Context

Test file: hello2-fixed.pdf

lrosenthol commented 1 year ago

Is the string in question Hello World, Hello\nWorld, or Hello \n \n \n\n\nWorld?

All of those are perfectly acceptable choices.

That PDF appears to be invalid - it is killing multiple parsers that I have tried it on. Did you produce this by hand or using some proper PDF creation tool/library?

MatthiasValvekens commented 1 year ago

The offset for object 8 was off by two bytes, apparently. I think it should've been 747, not 749. hello2-fixed.pdf

lrosenthol commented 1 year ago

Thanks @MatthiasValvekens

Using the -fixed version, it opens and displays just fine in Acrobat but only displays a blank page in Apple Preview, PDF Expert and FoxIt (Mac)

MatthiasValvekens commented 1 year ago

Actually, I think the current rule is even more problematic that it first appears. In general, there's no guarantee that bytes that would be considered white space in PDF data would actually map to white space glyphs.

Real-world example: https://www.public.asu.edu/~hdavulcu/siam02.pdf. This file contains text rendered with Type3 fonts with an encoding that starts its alphabet glyphs at codepoints 0x01, 0x02, .... In other words, white space and control characters would be rendered as actual letters. So, if this file were to contain a content stream split across a string (fortunately it doesn't), the resulting rendering would be totally undefined.

Regardless, allowing splitting strings down the middle seems like a bad idea because white space might not render as white space. If we can tighten that up, then I'm all for it: e.g., we could only allow a stream to terminate after an operator. Anyway, if that's a bridge too far, I think we should at least tell people exactly what white space to put in.

lrosenthol commented 1 year ago

If we can tighten that up, then I'm all for it: e.g., we could only allow a stream to terminate after an operator.

Perhaps a better statement might be something like:

Any individual content stream shall not contain an incomplete operator or operand. For example, a string object, serving as the operand, cannot be split across two streams).

MatthiasValvekens commented 1 year ago

Yes, fair enough, that's a less extreme change, so probably a safer one to make.

One additional thought: maybe we also need to explicitly say that incomplete inline image data in a stream is also not allowed.

pesco commented 1 year ago

The offset for object 8 was off by two bytes

Thanks Matthias. Happens every time. 😠 I've adjusted the top post with your file.

petervwyatt commented 1 year ago

PDF TWG agree in principle - will wordsmith for future PDF TWG meeting.

7.8.2 (para 3) already states "The operands needed by an operator shall precede it in the stream." which implies splitting can only occur between operator. Need to check other wording around the content stream array...

petervwyatt commented 1 year ago

7.1 also states "Content streams. A PDF content stream contains a sequence of instructions describing the appearance of a page or other graphical entity. These instructions, while also represented as objects, are conceptually distinct from the objects that represent the document structure and are described separately. ..."

petervwyatt commented 1 year ago

AFAICT the only wording for how/where content stream arrays can be split is located in Table 31 in the description of the Contents entry: "The value shall be either a single stream or an array of streams. If the value is an array, the effect shall be as if all of the streams in the array were concatenated with at least one white-space character added between the streams’ data, in order, to form a single stream. PDF writers can create image objects and other resources as they occur, even though they interrupt the content stream. The division between streams may occur only at the boundaries between lexical tokens (see 7.2, "Lexical conventions") but shall be unrelated to the page’s logical content or organisation. ... "

As noted above, 7.8.2 3rd para also states ""The operands needed by an operator shall precede it in the stream." which implies that splitting is intended to occur between operators (so that all operands are in the same stream containing the operator).

This also has no impact on other discussions related to resource inheritance and wording around "content stream" or "array of streams".

The proposed solution is then to simply point Table 31 at 7.8.2 (rather than 7.2 Lexical conventions) as follows - since this makes it a file format requirement and not a processor requirement:

The division between streams may occur only at the boundaries between operators (see 7.8.2, "Content streams") but ...

The fact that many implementations support splits at other locations (e.g. between operands) is then just permissive behaviour. And splitting mid-object is just plain bad as described above (another example I can think of is a comment that is not terminated by an EOL in the first stream - does it then span into the next stream until the next EOL??).

pesco commented 1 year ago

As noted above, 7.8.2 3rd para also states ""The operands needed by an operator shall precede it in the stream." which implies that splitting is intended to occur between operators (so that all operands are in the same stream containing the operator).

It seems to me that this reading might clash with the resolution of #9. My understanding of that errata is that there is a conceptual difference between the content stream and the individual stream objects that are combined to form it. I.e. the content stream is the result of the combination. The full text of the paragraph you quoted from is:

A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical conventions". It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See Example 4 in 7.4, "Filters" for an example of a content stream.

Emphasis mine. It seems clear that "the stream" refers to the "content stream", and I would (after #9) understand that as the combined whole, not an individual stream object.

(another example I can think of is a comment that is not terminated by an EOL in the first stream - does it then span into the next stream until the next EOL??).

A very good question. 7.2.4 "Comments" states that comments "consists of all characters ... up to ... the end-of-line marker". It even mentions that they can occur in content streams but makes no reference to the possibility of "stream breaks" as we might call them. So I'd say the comment must span across the break into the next stream object, unless that paragraph is amended.

Circling back, though, I think the essence of this issue is that the text of Table 31 makes the combining operation sound like it happens at the lexical or even byte level, but leaves its description too unspecific to produce an unambiguous interpretation at the syntactic, and by extension, the semantic level.

The way I see it, there are two options: Leave the combining operation at the lexical level but unambiguously define it, or lift it to the syntactic level and make it clear which syntactic objects must be "atomic" within a stream object.

pesco commented 1 year ago

A nit-pick:

The division between streams may occur only at the boundaries between operators (see 7.8.2, "Content streams") but ...

This sounds like "operators" occur adjacent to each other ("boundaries between"), which is not the case. The adjacent structures you mean, I assume, are the combination of an operator and its operands. I don't know if that has a name within the PDF spec, but one might call it an operator expression.

lrosenthol commented 1 year ago

The fact that many implementations support splits at other locations (e.g. between operands) is then just permissive behaviour.

Or more likely, simply a result of implementation. For example, if an implementation merges all streams in an array together into a single stream before consuming it - then none of this stuff factors in. Only if the implementation processes each stream individually before moving on to the next one would a problem occur. But as we always say - implementation independent.

lrosenthol commented 1 year ago

another example I can think of is a comment that is not terminated by an EOL in the first stream - does it then span into the next stream until the next EOL??).

Depends on implementation as I note in the comment above...

lrosenthol commented 1 year ago

Circling back, though, I think the essence of this issue is that the text of Table 31 makes the combining operation sound like it happens at the lexical or even byte level

I will note that the key word in that paragraph is "effect", as in "the effect shall be as if". Not a great standards word, I agree, but (IMO) it is intended to imply processor behavior and not syntax or lexical definition.

pesco commented 1 year ago

Or more likely, simply a result of implementation. For example, if an implementation merges all streams in an array together into a single stream before consuming it - then none of this stuff factors in. Only if the implementation processes each stream individually before moving on to the next one would a problem occur. But as we always say - implementation independent.

I'm confused, are you saying that the rendering of a PDF file may differ by implementation in the way that is being discussed here?

lrosenthol commented 1 year ago

I'm confused, are you saying that the rendering of a PDF file may differ by implementation in the way that is being discussed here?

Absolutely!

Many aspects of the PDF syntax do not have matching processor rules - or at least not strict ones. This is a good example. There is nothing in the standard that tells a processor HOW to find the content stream(s) of a page and then once located the way in which to consume them. It simply tells you that they can be found in a specific place in the file structure and when you process them, it "shall have the effect as if". That's it!

pesco commented 1 year ago

I'm confused, are you saying that the rendering of a PDF file may differ by implementation in the way that is being discussed here?

Absolutely!

I think Matthias Valvelkens provided all the necessary evidence to the contrary above.

lrosenthol commented 1 year ago

I think Matthias Valvelkens provided all the necessary evidence to the contrary above.

Really?

I think he actually backed me up with his statement:

So, if this file were to contain a content stream split across a string (fortunately it doesn't), the resulting rendering would be totally undefined.

lrosenthol commented 1 year ago

@pesco I would note that some of these conversations would be more productive if you would consider joining the PDF Association and participating in the meetings, where we discuss and debate the finer points of all of these issues. Only being able to communicate asynchronously with you reduces your ability to bring your position to the group.

MatthiasValvekens commented 1 year ago

@petervwyatt I think you may have meant to post that analysis on #102 :)

petervwyatt commented 1 year ago

Opps!! Thanks @MatthiasValvekens. I have now removed the comment from here and added to the correct issue so as not to ruin/confuse this issue.

petervwyatt commented 1 year ago

This and other related/interlinked issues (#194 #199 #201 #209) are being discussed in the "Securing PDF" DG of ISO TC 171 SC 2 WG 8 and will be labelled as "Parked" here in GitHub until such time as a set of solutions can be proposed.