libarchive / libarchive

Multi-format archive and compression library
http://www.libarchive.org
Other
3.02k stars 767 forks source link

Doesn't understand CSRG ISOs? #2232

Closed nabijaczleweli closed 3 months ago

nabijaczleweli commented 3 months ago

bsdtar -tf on https://archive.org/details/The_CSRG_Archives_CD-ROM_1_August_1998_Marshall_Kirk_McKusick (and 2, 3, and 4) simply returns empty, with a strace of

access("/dev/st0", F_OK)                = -1 ENOENT (No such file or directory)
brk(0x55d43023b000)                     = 0x55d43023b000
openat(AT_FDCWD, "/home/nabijaczleweli/store/BSD/The_CSRG_Archives_CD-ROM_1_August_1998_Marshall_Kirk_McKusick/The CSRG Archives CD-ROM 1 (August 1998) (Marshall Kirk McKusick).ISO", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=668307456, ...}, AT_EMPTY_PATH) = 0
brk(0x55d43025f000)                     = 0x55d43025f000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
lseek(3, 0, SEEK_END)                   = 668307456
lseek(3, 668307456, SEEK_SET)           = 668307456
lseek(3, 0, SEEK_END)                   = 668307456
lseek(3, 668291072, SEEK_SET)           = 668291072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 16384
lseek(3, 0, SEEK_END)                   = 668307456
lseek(3, 0, SEEK_SET)                   = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
close(3)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

This happens on 3.6.2-1, 3.7.2-2.1, and trunk (6ee1eebefdf41f36ef1a548c9a7000d132c453f3).

Other ISOs do work. This is distinct from #954 (i.e. I applied that diff and it still happens).

nabijaczleweli commented 3 months ago

behold, my amazing debugging methodology:

diff --git a/libarchive/archive_read_support_format_iso9660.c b/libarchive/archive_read_support_format_iso9660.c
index 25ab11b..93f0ad6 100644
--- a/libarchive/archive_read_support_format_iso9660.c
+++ b/libarchive/archive_read_support_format_iso9660.c
@@ -527,10 +527,10 @@ archive_read_format_iso9660_bid(struct archive_read *a, int best_bid)
            bytes_read -= LOGICAL_BLOCK_SIZE, p += LOGICAL_BLOCK_SIZE) {
                /* Do not handle undefined Volume Descriptor Type. */
                if (p[0] >= 4 && p[0] <= 254)
-                       return (0);
+                       {fprintf(stderr, "return 0: 1\n"); return (0);}
                /* Standard Identifier must be "CD001" */
                if (memcmp(p + 1, "CD001", 5) != 0)
-                       return (0);
+                       {fprintf(stderr, "return 0: 2\n"); return (0);}
                if (isPVD(iso9660, p))
                        continue;
                if (!iso9660->joliet.location) {
@@ -549,7 +549,7 @@ archive_read_format_iso9660_bid(struct archive_read *a, int best_bid)
                        seenTerminator = 1;
                        break;
                }
-               return (0);
+               {fprintf(stderr, "return 0: 3\n"); return (0);}
        }
        /*
         * ISO 9660 format must have Primary Volume Descriptor and

and further


static int
isVDSetTerminator(struct iso9660 *iso9660, const unsigned char *h)
{
    (void)iso9660; /* UNUSED */

    /* Type of the Volume Descriptor Set Terminator must be 255. */
    if (h[0] != 255)
        {fprintf(stderr, "isVDSetTerminator 1: %o\n", h[0]); return (0);}

    /* Volume Descriptor Version must be 1. */
    if (h[6] != 1)
        {fprintf(stderr, "isVDSetTerminator 2\n"); return (0);}

    return (1);
}

the CSRG ISO hits return 0: 3 due to isVDSetTerminator 1: 1, so this is a PVD but doesn't get caught by isPVD.

This appears to be because isPVD: (p[DR_length_offset] != 34) = 1, and indeed this is 68, so the root directory is twice as large as you're allowing it to be?

nabijaczleweli commented 3 months ago

Changing that to if (p[DR_length_offset] != 34 && p[DR_length_offset] != 68) fixes it :)

nabijaczleweli commented 3 months ago

Okay, this blames to 2af7b5f12003509f6c0ca93066b9209ebcb67883 (Improve mixed Joliet and Rock Ridge extentions. [...] Fix reading the root directory; it did not read Rock Ridge extentions of the one.) which does

-       /* Store the root directory in the pending list. */
-       file = parse_file_info(iso9660, NULL, h + PVD_root_directory_record_offset);
-       add_entry(iso9660, file);
+       /* Read Root Directory Record in Volume Descriptor. */
+       p = h + PVD_root_directory_record_offset;
+       if (p[DR_length_offset] != 34)
+               return (0);
+       iso9660->primary.sector_number = archive_le32dec(p + DR_extent_offset);
+       iso9660->primary.block_size = archive_le32dec(p + DR_size_offset);

it's unclear to me where the 34 comes from.

kientzle commented 3 months ago

Hmmm.... ECMA 119 (which is the same as ISO 9660) clearly specifies that the Directory Record for the Root Directory is a 34-byte field at offset 157 immediately followed by the Volume Set Identifier. These fields have the same size and offset in both the Primary Volume Descriptor and the Supplementary Volume Descriptor.

ECMA 119 is freely available here: https://ecma-international.org/publications-and-standards/standards/ecma-119/ You can look at Table 4 on page 24 to see the layout of the PVD.

Question: Is there a comment field in the first one or two kilobytes of the CD-ROM that indicates what software was used to create it? That might help.

kientzle commented 3 months ago

Another question: Can you print out the bytes of the Root Directory Record? (All 68 of them.)

kientzle commented 3 months ago

ECMA 119 Section 9.1 specifies the layout of the Directory Record for the Root Directory:

Offset Length Description
0 1 Length of directory record
1 1 Extended Attribute Record Length
2 8 Location of Extent
10 8 Data Length
18 7 Date and time
25 1 File flags
26 1 File Unit Size
27 1 Interleave gap size
28 4 Volume Sequence Number
32 1 Length of file identifier (LEN_FI)
33 LEN_FI File Identifier

For a Root Directory, the File Identifier is a single 00 byte, so the total record should be exactly 34 bytes. ECMA 119 does allow for "System Use" area after that, but "Its content is not specified by this standard."

Clarified in section C.3.6:

The Directory Identifier of a Directory Record describing the Root Directory shall consist of a single (00) byte.

Hmmm.... I just realized that 68 is the ASCII D character. I bet that block isn't in fact a PVD. This disk is organized differently than libarchive expects and the PVD is somewhere else, perhaps? A hex/ascii dump of that block should be quite illuminating.

nabijaczleweli commented 3 months ago

Here's the whole thing until it looks like data starts.

000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
*
008000 01 43 44 30 30 31 01 00 4c 49 4e 55 58 00 20 20  >.CD001..LINUX.  <
008010 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
008020 20 20 20 20 20 20 20 20 43 44 52 4f 4d 20 20 20  >        CDROM   <
008030 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
008040 20 20 20 20 20 20 20 20 00 00 00 00 00 00 00 00  >        ........<
008050 1c fa 04 00 00 04 fa 1c 00 00 00 00 00 00 00 00  >................<
008060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
008070 00 00 00 00 00 00 00 00 01 00 00 01 01 00 00 01  >................<
008080 00 08 08 00 76 07 01 00 00 01 07 76 14 00 00 00  >....v......v....<
008090 36 00 00 00 00 00 00 58 00 00 00 7a 44 00 9c 00  >6......X...zD...<
0080a0 00 00 00 00 00 9c 00 10 00 00 00 00 10 00 5f 08  >.............._.<
0080b0 0e 0d 10 21 1c 02 00 00 01 00 00 01 01 00 20 20  >...!..........  <
0080c0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
*
008320 20 20 20 20 20 20 20 20 20 20 20 20 20 31 39 39  >             199<
008330 35 30 38 31 34 31 33 31 36 33 33 30 30 20 31 39  >5081413163300 19<
008340 39 35 30 38 31 34 31 33 31 36 33 33 30 30 20 30  >95081413163300 0<
008350 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 20  >000000000000000 <
008360 31 39 39 35 30 38 31 34 31 33 31 36 33 33 30 30  >1995081413163300<
008370 20 01 00 20 20 20 20 20 20 20 20 20 20 20 20 20  > ..             <
008380 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
*
008570 20 20 20 00 00 00 00 00 00 00 00 00 00 00 00 00  >   .............<
008580 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
*
008800 01 43 44 30 30 31 01 00 4c 49 4e 55 58 00 20 20  >.CD001..LINUX.  <
008810 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
008820 20 20 20 20 20 20 20 20 43 44 52 4f 4d 20 20 20  >        CDROM   <
008830 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
008840 20 20 20 20 20 20 20 20 00 00 00 00 00 00 00 00  >        ........<
008850 1c fa 04 00 00 04 fa 1c 00 00 00 00 00 00 00 00  >................<
008860 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
008870 00 00 00 00 00 00 00 00 01 00 00 01 01 00 00 01  >................<
008880 00 08 08 00 76 07 01 00 00 01 07 76 14 00 00 00  >....v......v....<
008890 36 00 00 00 00 00 00 58 00 00 00 7a 44 00 9c 00  >6......X...zD...<
0088a0 00 00 00 00 00 9c 00 10 00 00 00 00 10 00 5f 08  >.............._.<
0088b0 0e 0d 10 21 1c 02 00 00 01 00 00 01 01 00 20 20  >...!..........  <
0088c0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
*
008b20 20 20 20 20 20 20 20 20 20 20 20 20 20 31 39 39  >             199<
008b30 35 30 38 31 34 31 33 31 36 33 33 30 30 20 31 39  >5081413163300 19<
008b40 39 35 30 38 31 34 31 33 31 36 33 33 30 30 20 30  >95081413163300 0<
008b50 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 20  >000000000000000 <
008b60 31 39 39 35 30 38 31 34 31 33 31 36 33 33 30 30  >1995081413163300<
008b70 20 01 00 20 20 20 20 20 20 20 20 20 20 20 20 20  > ..             <
008b80 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
*
008d70 20 20 20 00 00 00 00 00 00 00 00 00 00 00 00 00  >   .............<
008d80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
*
009000 ff 43 44 30 30 31 01 00 00 00 00 00 00 00 00 00  >.CD001..........<
009010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
*
009800 ff 43 44 30 30 31 01 00 00 00 00 00 00 00 00 00  >.CD001..........<
009810 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
*
00a000 01 00 9c 00 00 00 01 00 00 00 04 00 a4 01 00 00  >................<
00a010 01 00 31 42 53 44 04 00 93 06 00 00 01 00 32 2e  >..1BSD........2.<
00a020 31 30 04 00 9e 15 00 00 01 00 32 2e 37 39 03 00  >10........2.79..<
00a030 68 16 00 00 01 00 32 2e 38 00 03 00 bc 03 00 00  >h.....2.8.......<
00a040 01 00 32 2e 39 00 05 00 88 17 00 00 01 00 32 2e  >..2.9.........2.<
00a050 39 50 55 00 04 00 c8 01 00 00 01 00 32 42 53 44  >9PU.........2BSD<
00a060 04 00 28 02 00 00 01 00 33 42 53 44 03 00 35 0a  >..(.....3BSD..5.<
00a070 00 00 01 00 34 2e 30 00 05 00 9a 24 00 00 01 00  >....4.0....$....<
00a080 34 2e 30 30 30 00 05 00 f2 0e 00 00 01 00 34 2e  >4.000.........4.<
00a090 30 30 31 00 03 00 81 0c 00 00 01 00 34 2e 31 00  >001.........4.1.<
00a0a0 04 00 90 11 00 00 01 00 34 2e 31 41 04 00 92 12  >........4.1A....<
00a0b0 00 00 01 00 34 2e 31 43 03 00 7f 19 00 00 01 00  >....4.1C........<
00a0c0 34 2e 32 00 05 00 3b 1d 00 00 01 00 34 2e 32 42  >4.2...;.....4.2B<
00a0d0 55 00 03 00 70 1d 00 00 01 00 34 2e 33 00 0a 00  >U...p.....4.3...<
00a0e0

Doesn't look like any branding.

for(int x = 0; x < 68; ++x) fprintf(stderr, "%02hhx", ((const unsigned char *)p)[x]); yields

000000 44 00 9c 00 00 00 00 00 00 9c 00 10 00 00 00 00  >D...............<
000010 10 00 5f 08 0e 0d 10 21 1c 02 00 00 01 00 00 01  >.._....!........<
000020 01 00 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >..              <
000030 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
000040 20 20 20 20                                      >    <
000044

This matches blocks starting on lines 008090 and 008890 in the full dump (D...<).

If you neuter these checks (and thus let it read it as-if it were a PVD?) libarchive reads this correctly. idk

kientzle commented 3 months ago

Thank you! I think that gives us enough information to come up with a reasonable way to resolve this...

I don't see anything that suggests what software wrote this, unfortunately. We can guess it was written on a Linux system in August 1995, but that's not a lot to go on. If we could find the software that wrote it and it happened to be open-source, maybe we could look at the code that wrote this PVD. Maybe there are comments to explain why they did it the way they did? For example, the current ECMA 119 clearly requires the root directory descriptor to be 34 bytes, but maybe there was an early pre-standard draft that had a slightly different layout? Maybe some later extension allowed for a different root directory size? Wikipedia points out that the original High Sierra specification "was adopted in December 1986 (with changes) as ... ECMA-119." I wonder if there's a copy of the original High Sierra spec around somewhere? It would be interesting to know what those "changes" were. Maybe High Sierra had a 68-byte root directory descriptor; that would suggest your idea of accepting either 34 or 68 is a good approach.

It's been a long time since I studied this part of the spec; I'd have to go over it again to be sure of the following, but here's what I think is happening here: it looks like there are two volume descriptors (VDs), one starting at 8000, the other at 8800. They look identical to me on a quick scan, so I presume these are duplicate Primary VDs. That means the 68 bytes you dumped really is intended to be a root directory descriptor.

To be clear: I don't want to remove those checks; I want to improve them. If we're going to make the length check weaker, I'd like to add other checks so that we can be confident we're not accepting something that really is not an ISO9660 image. Looking at these 68 bytes and the spec I quoted above, what about it makes us confident that it really is a root directory descriptor? I see a few obvious things:

Is there anything else here we could verify to be sure this really is what we think it is? File flags? Date/time? We might need to dig through the spec a little more, but I think we can improve this quite a bit.

kientzle commented 3 months ago

So much for that theory ... Harumph.

I found an old article detailing the ISO9660 and High Sierra volume descriptor layout. According to this, the High Sierra Format also had a 34-byte directory entry, but located at an entirely different offset within the volume descriptor. (Which suggests that it could be an interesting exercise to add High Sierra support to libarchive someday. :grin:)

nabijaczleweli commented 3 months ago

This is CD 1 of https://www.mckusick.com/csrg/, and the disc this is an image of is printed with "August 1998" (McKusick also cites 1998 as first release), and ripped (https://archive.org/details/The_CSRG_Archives_CD-ROM_1_August_1998_Marshall_Kirk_McKusick) probably around 2017-07-08 (IA addeddate), by Jason Scott of textfiles fame.

The top-level directories are, indeed, dated 1992-1995, so this is probably the date when collection or mastering started. .MAP, the latest file, is dated 1995-08-14, which I'm assuming is the date of the final master? (This seems to match "1995081413163300" in the dump, so probably.)

McKusick is AFAIK still alive. Scott is very much still alive. It doesn't look like either of them wrote what they used but they probably respond to e-mail. (That said, i personally don't really know who's to blame for the fucked-up bits. If an ISO is a binary-equivalent image of the disc (is it? does it depend? I don't know) then it's gonna be McKusick, and sending him an e-mail of "hey OG, what did you master the CSRG ISOs in/with?" will probably get me or you a response.) Given that McKusick is The BSD Guy, I'd be surprised if he mastered this on Linux. But maybe!

I agree with your assessment of what the data seems to be. I'd say that in this case what makes us sure that this is the root directory is that it has the contents of the root directory, but idk how to prove this analytically.

OTOH, xorriso seems to accept this format. This is either because it doesn't care or because it does a similar analysis.

kientzle commented 3 months ago

Take a look at #2238. I've verified that this lets us read the CSRG ISO image from archive.org. I've added a test case to make sure this works forever, and I've also inserted additional checks to help make sure the ISO reader doesn't try to read something that's not really an ISO image.

M-a-r-k commented 2 months ago

I found an old article detailing the ISO9660 and High Sierra volume descriptor layout. According to this, the High Sierra Format also had a 34-byte directory entry, but located at an entirely different offset within the volume descriptor. (Which suggests that it could be an interesting exercise to add High Sierra support to libarchive someday. 😁)

It looks like the High Sierra specification is available via https://www.os2museum.com/wp/looking-for-high-sierra/

kientzle commented 2 months ago

Thank you for tracking that down. Anyone want to try implementing it? Looks like it's a matter of looking for "CDROM" or "CD001" as a signature and then just using different offsets for everything from there down.

I presume there's still software around that knows how to construct High Sierra images -- that will be essential for building appropriate test cases. Just to be clear, I do not think we should try adding write support for High Sierra format to libarchive. And if there's no software that writes it anymore, that might be an argument for not bothering even with read support.