wader / fq

jq for binary formats - tool, language and decoders for working with binary and text formats
Other
9.63k stars 221 forks source link

zip: Fix incorrect time/date, add extended timestamp and refactor #793

Closed wader closed 8 months ago

wader commented 9 months ago

MSDOS time/date was read in wrong order and also did not take into account that the bit ranges in the shortis are in little-endian.

Remodel modification_time/date to be one struct with fat_time, fat_date LE shorts and then synthetic values for day, hours, minute etc and also a unix field with the timestamp as unix time.

Also refactor and clenaup extra fields/extended code a bit.

Fixes #792

TomiBelan commented 9 months ago

day, month, year, hour, minute looks correct! Thanks! seconds is divided by 2, as expected. But, in unix: 1697925626 (2023-10-21T22:00:26Z), should it really be Z (Zulu time a.k.a. UTC)?

Example: (notice 20:00 vs 22:00)

[machine ~]$ touch foo
[machine ~]$ stat foo
  File: foo
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: 8,5     Inode: 30503682    Links: 1
Access: (0644/-rw-r--r--)  Uid: (50000/    user)   Gid: (50000/    user)
Access: 2023-10-21 22:00:25.451721086 +0200
Modify: 2023-10-21 22:00:25.451721086 +0200
Change: 2023-10-21 22:00:25.451721086 +0200
 Birth: 2023-10-21 22:00:25.451721086 +0200
[machine ~]$ zip file.zip foo
  adding: foo (stored 0%)
[machine ~]$ go run github.com/wader/fq@zip-correct-date-time-fields d file.zip
    |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15|0123456789abcdef012345|.{}: file.zip (zip)
    |                                                                 |                      |  local_files[0:1]:
    |                                                                 |                      |    [0]{}: local_file
0x00|50 4b 03 04                                                      |PK..                  |      signature: raw bits (valid)
0x00|            0a 00                                                |    ..                |      version_needed: 10
    |                                                                 |                      |      flags{}:
0x00|                  00                                             |      .               |        unused0: 0
0x00|                  00                                             |      .               |        strong_encryption: false
0x00|                  00                                             |      .               |        compressed_patched_data: false
0x00|                  00                                             |      .               |        enhanced_deflation: false
0x00|                  00                                             |      .               |        data_descriptor: false
0x00|                  00                                             |      .               |        compression0: false
0x00|                  00                                             |      .               |        compression1: false
0x00|                  00                                             |      .               |        encrypted: false
0x00|                     00                                          |       .              |        reserved0: 0
0x00|                     00                                          |       .              |        mask_header_values: false
0x00|                     00                                          |       .              |        reserved1: false
0x00|                     00                                          |       .              |        language_encoding: false
0x00|                     00                                          |       .              |        unused1: 0
0x00|                        00 00                                    |        ..            |      compression_method: "none" (0)
    |                                                                 |                      |      last_modification{}:
0x00|                              0d b0                              |          ..          |        fat_time: 45069
    |                                                                 |                      |        second: 13
    |                                                                 |                      |        minute: 0
    |                                                                 |                      |        hour: 22
0x00|                                    55 57                        |            UW        |        fat_date: 22357
    |                                                                 |                      |        day: 21
    |                                                                 |                      |        month: 10
    |                                                                 |                      |        year: 43
    |                                                                 |                      |        unix: 1697925626 (2023-10-21T22:00:26Z)
0x00|                                          00 00 00 00            |              ....    |      crc32_uncompressed: 0x0
0x00|                                                      00 00 00 00|                  ....|      compressed_size: 0
0x16|00 00 00 00                                                      |....                  |      uncompressed_size: 0
0x16|            03 00                                                |    ..                |      file_name_length: 3
0x16|                  1c 00                                          |      ..              |      extra_field_length: 28
0x16|                        66 6f 6f                                 |        foo           |      file_name: "foo"
    |                                                                 |                      |      extra_fields[0:2]:
    |                                                                 |                      |        [0]{}: extra_field
0x16|                                 55 54                           |           UT         |          tag: 0x5455 (extended timestamp)
0x16|                                       09 00                     |             ..       |          size: 9
    |                                                                 |                      |          flags{}:
0x16|                                             03                  |               .      |            unused: 0
0x16|                                             03                  |               .      |            creation_time_present: false
0x16|                                             03                  |               .      |            access_time_present: true
0x16|                                             03                  |               .      |            modification_time_present: true
0x16|                                                d9 2d 34 65      |                .-4e  |          modification_time: 1697918425 (2023-10-21T20:00:25Z)
0x16|                                                            d9 2d|                    .-|          access_time: 1697918425 (2023-10-21T20:00:25Z)
0x2c|34 65                                                            |4e                    |
    |                                                                 |                      |        [1]{}: extra_field
0x2c|      75 78                                                      |  ux                  |          tag: 0x7875 (UNIX UID/GID)
0x2c|            0b 00                                                |    ..                |          size: 11
0x2c|                  01 04 50 c3 00 00 04 50 c3 00 00               |      ..P....P...     |          data: raw bits
    |                                                                 |                      |      uncompressed: raw bits
    |                                                                 |                      |  central_directories[0:1]:
    |                                                                 |                      |    [0]{}: central_directory
0x2c|                                                   50 4b 01 02   |                 PK.. |      signature: raw bits (valid)
0x2c|                                                               1e|                     .|      version_made_by: 798
0x42|03                                                               |.                     |
0x42|   0a 00                                                         | ..                   |      version_needed: 10
    |                                                                 |                      |      flags{}:
0x42|         00                                                      |   .                  |        unused0: 0
0x42|         00                                                      |   .                  |        strong_encryption: false
0x42|         00                                                      |   .                  |        compressed_patched_data: false
0x42|         00                                                      |   .                  |        enhanced_deflation: false
0x42|         00                                                      |   .                  |        data_descriptor: false
0x42|         00                                                      |   .                  |        compression0: false
0x42|         00                                                      |   .                  |        compression1: false
0x42|         00                                                      |   .                  |        encrypted: false
0x42|            00                                                   |    .                 |        reserved0: 0
0x42|            00                                                   |    .                 |        mask_header_values: false
0x42|            00                                                   |    .                 |        reserved1: false
0x42|            00                                                   |    .                 |        language_encoding: false
0x42|            00                                                   |    .                 |        unused1: 0
0x42|               00 00                                             |     ..               |      compression_method: "none" (0)
    |                                                                 |                      |      last_modification{}:
0x42|                     0d b0                                       |       ..             |        fat_time: 45069
    |                                                                 |                      |        second: 13
    |                                                                 |                      |        minute: 0
    |                                                                 |                      |        hour: 22
0x42|                           55 57                                 |         UW           |        fat_date: 22357
    |                                                                 |                      |        day: 21
    |                                                                 |                      |        month: 10
    |                                                                 |                      |        year: 43
    |                                                                 |                      |        unix: 1697925626 (2023-10-21T22:00:26Z)
0x42|                                 00 00 00 00                     |           ....       |      crc32_uncompressed: 0x0
0x42|                                             00 00 00 00         |               ....   |      compressed_size: 0
0x42|                                                         00 00 00|                   ...|      uncompressed_size: 0
0x58|00                                                               |.                     |
0x58|   03 00                                                         | ..                   |      file_name_length: 3
0x58|         18 00                                                   |   ..                 |      extra_field_length: 24
0x58|               00 00                                             |     ..               |      file_comment_length: 0
0x58|                     00 00                                       |       ..             |      disk_number_where_file_starts: 0
0x58|                           00 00                                 |         ..           |      internal_file_attributes: 0
0x58|                                 00 00 a4 81                     |           ....       |      external_file_attributes: 2175008768
0x58|                                             00 00 00 00         |               ....   |      relative_offset_of_local_file_header: 0
0x58|                                                         66 6f 6f|                   foo|      file_name: "foo"
    |                                                                 |                      |      extra_fields[0:2]:
    |                                                                 |                      |        [0]{}: extra_field
0x6e|55 54                                                            |UT                    |          tag: 0x5455 (extended timestamp)
0x6e|      05 00                                                      |  ..                  |          size: 5
    |                                                                 |                      |          flags{}:
0x6e|            03                                                   |    .                 |            unused: 0
0x6e|            03                                                   |    .                 |            creation_time_present: false
0x6e|            03                                                   |    .                 |            access_time_present: true
0x6e|            03                                                   |    .                 |            modification_time_present: true
0x6e|               d9 2d 34 65                                       |     .-4e             |          modification_time: 1697918425 (2023-10-21T20:00:25Z)
    |                                                                 |                      |        [1]{}: extra_field
0x6e|                           75 78                                 |         ux           |          tag: 0x7875 (UNIX UID/GID)
0x6e|                                 0b 00                           |           ..         |          size: 11
0x6e|                                       01 04 50 c3 00 00 04 50 c3|             ..P....P.|          data: raw bits
0x84|00 00                                                            |..                    |
    |                                                                 |                      |      file_comment: ""
    |                                                                 |                      |  end_of_central_directory_record{}:
0x84|      50 4b 05 06                                                |  PK..                |    signature: raw bits (valid)
0x84|                  00 00                                          |      ..              |    disk_nr: 0
0x84|                        00 00                                    |        ..            |    central_directory_start_disk_nr: 0
0x84|                              01 00                              |          ..          |    nr_of_central_directory_records_on_disk: 1
0x84|                                    01 00                        |            ..        |    nr_of_central_directory_records: 1
0x84|                                          49 00 00 00            |              I...    |    size_of_central_directory: 73
0x84|                                                      3d 00 00 00|                  =...|    offset_of_start_of_central_directory: 61
0x9a|00 00|                                                           |..|                   |    comment_length: 0
    |                                                                 |                      |    comment: ""
[machine ~]$ unzip -lv file.zip
Archive:  file.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
       0  Stored        0   0% 2023-10-21 22:00 00000000  foo
--------          -------  ---                            -------
       0                0   0%                            1 file
[machine ~]$ bsdtar tvvf file.zip
-rw-r--r--  0 50000  50000       0 Oct 21 22:00 foo
Archive Format: ZIP 1.0 (uncompressed),  Compression: none
wader commented 9 months ago

Good catch. As i understand it the last modification that uses msdos format (not the extended timestamp) has no timezone? is local? so i changed to remove the "Z".

I also changed so that second field is now shown *2, is less confusing?

TomiBelan commented 9 months ago

Removing Z sounds good to me. This page also says MSDOS times don't have a timezone: http://fileformats.archiveteam.org/wiki/MS-DOS_date/time

About raw seconds or *2, what does fq do in other such cases? For what it's worth, year is also year+1980. IMHO showing raw_second: 13, raw_year: 43, unix: ... (2023-10-21T22:00:26) would be a good solution.

Currently the output is:

    |                                               |                |      last_modification{}:
0x40|                           0d b0               |         ..     |        fat_time: 45069
    |                                               |                |        second: 26
    |                                               |                |        minute: 0
    |                                               |                |        hour: 22
0x40|                                 55 57         |           UW   |        fat_date: 22357
    |                                               |                |        day: 21
    |                                               |                |        month: 10
    |                                               |                |        year: 43
    |                                               |                |        unix: 1697925626 (2023-10-21T22:00:26)
...
    |                                               |                |      extra_fields[0:2]:
...
0x70|         d9 2d 34 65                           |   .-4e         |          modification_time: 1697918425 (2023-10-21T20:00:25Z)

I feel I'm nitpicking really minor details at this point, but 1697925626 is suboptimal. It should probably be unix: 1697918426 (2023-10-21T22:00:26). Here 2023-10-21T22:00:26 is literally from the zip and 1697918426 is a best guess assuming local timezone. Or not show the unix timestamp guesstimate at all, only the (...). But this is just a nitpick, the current version is also good enough for me.

wader commented 9 months ago

Removing Z sounds good to me. This page also says MSDOS times don't have a timezone: http://fileformats.archiveteam.org/wiki/MS-DOS_date/time

About raw seconds or *2, what does fq do in other such cases? For what it's worth, year is also year+1980. IMHO showing raw_second: 13, raw_year: 43, unix: ... (2023-10-21T22:00:26) would be a good solution.

It's mostly up to a format decoder how to "model" things. Each field (the thing in the tree and has a name) is tied to a "decode value" that consist of an optional backing bit buffer and range (otherwise seen as "synthetic"), a actual value, an optional symbolic value and a description. The symbolic value is the default value if set, otherwise actual is used in expression etc. One can use toactual to access to actual (there is also tosym but is less usefull)

So in this case i can see some alternatives: (using year as example)

year: synthetic with actual value 2000 raw_year: synthetic with actual value 20

Or

year: synthetic with actual value 20 and symbolic value 2000

If the msdos timestamp was not little-endian with non-byte-aligned bit-ranges (month and minute is not a continuous bit range i think?) i would probably have model year as actual 20 and symbolic 2000. Maybe should do that even when synthetic then?

Currently the output is:

    |                                               |                |      last_modification{}:
0x40|                           0d b0               |         ..     |        fat_time: 45069
    |                                               |                |        second: 26
    |                                               |                |        minute: 0
    |                                               |                |        hour: 22
0x40|                                 55 57         |           UW   |        fat_date: 22357
    |                                               |                |        day: 21
    |                                               |                |        month: 10
    |                                               |                |        year: 43
    |                                               |                |        unix: 1697925626 (2023-10-21T22:00:26)
...
    |                                               |                |      extra_fields[0:2]:
...
0x70|         d9 2d 34 65                           |   .-4e         |          modification_time: 1697918425 (2023-10-21T20:00:25Z)

I feel I'm nitpicking really minor details at this point, but 1697925626 is suboptimal. It should probably be unix: 1697918426 (2023-10-21T22:00:26). Here 2023-10-21T22:00:26 is literally from the zip and 1697918426 is a best guess assuming local timezone. Or not show the unix timestamp guesstimate at all, only the (...). But this is just a nitpick, the current version is also good enough for me.

No worries! glad there is someone else to discuss with, quite a lot of time writing decoders is actually lots of debating with one self about such things :)

Guess here means use locally configured timezone where fq i running? yeah that feels a bit shaky, have a feeling it's good to keep the output not dependent on such things. So is the option to include it all or clearly somehow indicated assumed timezone (UTC?)? rename it to unix_utc or unix_guess etc?

What did you mean by "only the (...)" btw?

TomiBelan commented 9 months ago

Hmm, does that mean you can show "year: 23 (2023)"? But having both year and raw_year sounds good too.

Oh sorry, by "only the (...)" I meant remove the timestamp number and only keep the part that is currently shown between ( and ). I.e. (2023-10-21T22:00:26)

wader commented 9 months ago

Hmm, does that mean you can show "year: 23 (2023)"? But having both year and raw_year sounds good too.

Yeap but it would probably be the other way around: year: 2003 (23) is symbolic is set it's displayed as <sym> (<actual>). Let's try that.

Oh sorry, by "only the (...)" I meant remove the timestamp number and only keep the part that is currently shown between ( and ). I.e. (2023-10-21T22:00:26)

Aha i see, problem is the (2023-10-21T22:00:26) is the description so as it works now there has to be some kind of field to tie it to.

wader commented 9 months ago

Added a description to unix_guess and added some docs about time zones. Looks ok?

TomiBelan commented 8 months ago

Looks ok.

My main disagreement is that in the grand question of "use UTC (thus, behave differently than what unzip programs would do) or use local time (thus, fq output depends on user's timezone)?" I'm still not so sure I'd pick UTC... but I see the arguments in favor, and the current output is a good realization of that choice.

wader commented 8 months ago

Thanks for reviewing. I'm trying to convince myself if it's a good or bad idea for the output to depend on local settings or not, feels somehow safer and less surprising to not?

What would you pick instead of UTC?

TomiBelan commented 8 months ago

Either UTC, or local, or nothing. Disadvantage of UTC: doesn't match what you'd get if you unzip the zip and stat the extracted file. Disadvantage of local: fq output depends on user's local timezone, this might be unexpected and needs to be handled in tests. Disadvantage of nothing: "the (2023-10-21T22:00:26) is the description so as it works now there has to be some kind of field to tie it to."

UTC is fine. Ship it.

wader commented 8 months ago

Agree! thanks for talking your time