ssimms / pdfapi2

Create, modify, and examine PDF files in Perl
Other
15 stars 21 forks source link

Incorrect format for PDF Dates #62

Closed sciurius closed 1 year ago

sciurius commented 1 year ago

According to PDF 1.7 section 7.9.4 the (full) format is YYYYMMDDHHmmSSOHH'mm'. Both HH and mm (if present) must be followed by an apostrophe character. Setting a valid date like 20230313194003+01'00' will result in an error Invalid date string: D:20230313194003+01'00' at ....

The check on the date format was introduced in 2.042. The regexp in sub _is_date (PDF/API2, around line 555) should be:

    return unless $value =~ /^D:([0-9]{4})        # D:YYYY (required)
                             (?:([01][0-9])       # Month (01-12)
                             (?:([0123][0-9])     # Day (01-31)
                             (?:([012][0-9])      # Hour (00-23)
                             (?:([012345][0-9])   # Minute (00-59)
                             (?:([012345][0-9])   # Second (00-59)
                             (?:([Z+-])           # UT Offset Direction
                             (?:([012][0-9]\')    # UT Offset Hours plus apostrophe
                             (?:([012345][0-9]\') # UT Offset Minutes plus apostrophe
                             )?)?)?)?)?)?)?)?$/x;
sciurius commented 1 year ago

Some additional information:

PDF Reference specifications 1.4 through 1.7 explicitly state that the apostrophe after offset hours and minutes is part of the syntax, and hence must be there.

The ISO approved version of PDF Reference 1.7 states no apostrophe following the minutes part.

When PDF::API2 generates a PDF document it starts with "%PDF-1.4" so I'd say the Adobe PDF 1.4 reference is leading here. It says that the trailing apostrophe for both timezone hours and minutes is mandatory.

For practical purposes, I would go for the final apostrophe being optional.

                       (?:([012345][0-9]\'?) # UT Offset Minutes plus optional apostrophe
ssimms commented 1 year ago

I hadn't noticed (or, if I did, had since forgotten) that the Adobe and ISO versions of PDF 1.7 aren't the same. That's unfortunate. In any case, I've been working from the ISO version when adding/updating code. Thanks for making me aware of the differences between the specifications.

My current reading of the ISO versions says that the apostrophe after the offset hour is optional if there isn't an offset minute:

The APOSTROPHE following the hour offset field (HH) shall only be present if the HH field is present. The minute offset field (mm) shall only be present if the APOSTROPHE following the hour offset field (HH) is present).

On that basis, I've made both apostrophes optional, unless both the offset hour and minute are present, in which case there must be an apostrophe between them. I've also added a bunch of tests for the various valid formats and some invalid ones (the code isn't trying to catch all invalid dates, just egregious format errors), just to confirm that all the other variations are working.

So, specifically, these two cases now work but didn't previously: