Multithreading PNG decoding, "parallel" PNG's

richgel999 commented 2 years ago

It looks possible to write completely standard PNG files that can be optionally decompressed across multiple threads: https://twitter.com/richgel999/status/1470846966276018187

To do this, you would use multiple IDAT's, force a flush to a byte boundary at the end of each IDAT using a 0 or 1 byte uncompressed Deflate block (if necessary - 1/8th of the time it shouldn't be), and make sure the LZ matches never cross IDAT block boundaries. A special standard ancillary extension chunk (which would be ignored by existing readers) would hint to new decoders that the file can be decoded in parallel.

The filtering can also be done in parallel with decompression, using a simple pipeline.

This would significantly reduce the wall-clock time involved in reading large (8k-32k) PNG files on modern machines.

DavidBuchanan314 commented 2 years ago

I've been thinking about this also: https://github.com/brion/mtpng/issues/20

My original idea was to have some metadata that points to offsets within the file, for each zlib sub-block. However, your idea of just using IDAT block boundaries could make more sense, and only requires one bit of additional metadata to signal.

An additional metadata flag could be used to signal that the first row of each block has filter type none (or other filter type that does not reference the previous row) - if this is known by the decoder, then it can perform defiltering in parallel too.

randy408 commented 2 years ago

I've had this idea for a while and have made my png-restart-marker repo public, it's kind of like JPEG restart markers for PNG, will push what I have so far tomorrow.

An additional metadata flag could be used to signal that the first row of each block has filter type none (or other filter type that does not reference the previous row) - if this is known by the decoder, then it can perform defiltering in parallel too.

If you start on a scanline, IDAT and DEFLATE boundary the filter byte will be the first thing, I would just exclude filter types 2, 3, 4 at the start of each segment, then each worker is truly independent.

richgel999 commented 2 years ago

I've been thinking about this also: brion/mtpng#20

My original idea was to have some metadata that points to offsets within the file, for each zlib sub-block. However, your idea of just using IDAT block boundaries could make more sense, and only requires one bit of additional metadata to signal.

An additional metadata flag could be used to signal that the first row of each block has filter type none (or other filter type that does not reference the previous row) - if this is known by the decoder, then it can perform defiltering in parallel too.

Yes. The only spec change would be the definition of a new standard ancillary chunk type (that would be ignored by existing readers) that indicates which levels of parallelization are supported on the file: parallel IDAT decompression only, or parallel IDAT decompression+parallel de-filtering. It could also indicate the # of scanlines present in each IDAT chunk.

This is so useful that my plan is to put together a quick prototype for 24bpp and 32bpp, specifically in order to benchmark it.

richgel999 commented 2 years ago

I think if we could agree on a standard chunk type that contains this metadata, multiple independent implementations could then be created with faster writing/reading. This would be extremely valuable (large PNG's, like 8k-16k are now becoming common).

DavidBuchanan314 commented 2 years ago

One question is, should the safe-to-copy bit be set, in the chunk name? I suppose it shouldn't.

An unaware PNG processing tool may do things like combine or re-split IDAT chunks, which would break parallel decoders if they still tried to process the chunks in parallel.

DavidBuchanan314 commented 2 years ago

I have created a proof-of-concept implementaion of parallel encoding and decoding here https://github.com/DavidBuchanan314/parallel-png-proposal , along with a proposal for how the metadata could be encoded.

richgel999 commented 2 years ago

Wow that was fast! Thank you. My implementation can use this chunk data.

One question is, should the safe-to-copy bit be set, in the chunk name? I suppose it shouldn't. An unaware PNG processing tool may do things like combine or re-split IDAT chunks, which would break parallel decoders if they still tried to process the chunks in parallel.

Yea, I'm thinking it shouldn't be set. According to the PNG spec, the IDAT boundaries have "no semantic significance", so a processing tool may recompress the data. It's the safe thing to do because we can't be sure what the processing tool is going to do with the IDAT data and/or filters.

richgel999 commented 2 years ago

I'm wondering if the chunk should have an array with offsets/lengths that point to each IDAT chunk? (Someone suggested this, not sure who/where.) It's not necessary because a PNG reader can just find them by scanning the file, but perhaps there is some value in this on very large images. Having to find all the chunks and dispatch the decode jobs is serial processing.

On the flip side, the IDAT offsets could be brittle in some way I can't think of, isn't necessary, and introduces a concept into PNG that seems to go against the flow.

richgel999 commented 2 years ago

A decoder should be able to determine the number of such pieces by calculating floor(image_height / piece_height). Each piece is stored in its own IDAT chunk.

In your proposal: Shouldn't this be ceil(image_image/piece_height)? If the image height is 17 and the piece_height is 8, I would expect this to be 3 (not 2). (I filed a bug.)

DavidBuchanan314 commented 2 years ago

Hah, I spotted the bug around the same time as you did, I pushed a fix already.

On the flip side, the IDAT offsets could be brittle in some way I can't think of, isn't necessary, and introduces a concept into PNG that seems to go against the flow.

Yes, I've been having similar thoughts. In practice, I don't think it'd be any faster to have an offset table, although it would be marginally faster if you wanted to implement random-access (but I'm not sure anyone would really want that?)

DavidBuchanan314 commented 2 years ago

One argument for using an offset table, is that you can have your IDAT chunks be arbitrary sizes - I know some encoders like to use chunks of a certain size, although I've never understood why exactly.

randy408 commented 2 years ago

I cleaned up and published the specification I had laying around, also added a method to encode without an offset table for maximum bikeshedding potential: https://github.com/libspng/png-restart-marker

DavidBuchanan314 commented 2 years ago

So far we have three potential approaches:

Apple's undocumented iDOT chunk (which has been mostly reverse-engineered https://www.hackerfactor.com/blog/index.php?/archives/895-Connecting-the-iDOTs.html )
My proposed pLLD chunk (https://github.com/DavidBuchanan314/parallel-png-proposal)
@randy408's proposed mARK chunk (https://github.com/libspng/png-restart-marker)

As a very brief summary:

(Note: Here I use "piece" to refer to the subsections of the zlib stream, and/or slices of image)

iDOT (as best we can tell) records the pixel heights, and file offsets, of each piece.
pLLD records a single piece-height value, from which the number of pieces and height of the last piece can be deterministically derived. The file offset for each piece must be determined by parsing the IDAT chunks, with one piece per chunk.
mARK supports two different "segmentation types":
- Type 0 records the number of pieces, and the relative file offset for each piece. The height pixel height of each piece is deterministic from this information.
- Type 1 works largely the same as how pLLD works, except it records the number of pieces, rather than their height, and in cases where the image height is not evenly divisible, the last piece is larger (whereas in pLLD, the last piece is shorter).

mARK has fields that encode the "segmentation method" and "segmentation type", which makes it more future-proof in terms of implementing new strategies to parallelise decoding in the future.

It is my opinion that recording the file offsets of pieces (as in iDOT and mARK Type 0) is a bad idea. Knowing the piece offsets gives you a slight advantage, in that you can immediately dispatch work to all your decoder threads. However for a decoder to be 100% sure that an offset is valid, you need to perform a lot of checks, otherwise you risk a maliciously crafted file being ambiguous (such as I described here ). For example, you could have data that looks like a valid IDAT chunk, contained entirely within another IDAT chunk. The only way to detect this case is to parse all the IDAT chunks sequentially from the beginning. (Which somewhat defeats the purpose of knowing the offsets beforehand - although, this could be done in parallel with the actual decompression)

pLLD and mARK type 1 are slightly less flexible, because each piece must occupy exactly one IDAT chunk. I believe there are sometimes reasons to keep IDAT chunks small (although I'm not entirely sure what these reasons are), so there may be a desire to have one piece occupy multiple IDAT chunks. If this is a desirable feature, then I propose an alternative approach:

Record a table of IDAT chunk indices (either in absolute or relative terms).

The pLLD and/or mARK specifications could easily be amended to support this. I believe this approach would be the best way to maximise flexibility, and minimise parsing edge-cases.

I believe @brion is looking into implementing parallel decoding in mtpng, and I would be interested to see the performance numbers, especially in terms of quantifying how much time it costs to initially iterate over the IDAT chunks, if their offsets not recorded.

lrosenthol commented 2 years ago

As we are currently working on an update to PNG, the timing for this is quite good...

My personal preference would be to avoid anything that includes offset tables for (a) the reasons that @DavidBuchanan314 mentioned concerning attacks but also (b) because of scenarios where assets are modified w/o changing their content (e.g., updated metadata) that could well change the byte offsets.

So while I haven't looked at the proposals in depth, I would tend to favor pLLD

randy408 commented 2 years ago

I believe error handling will be the hardest part of this, do note that mARK type 1 and pLLD still have issues with pieces possibly being too long or too short for which the only conformant behavior is to fall back to linear decoding.

I went into the details of this and have made the argument that any error during parallel decoding must be recoverable and should be continued with linear decoding: https://github.com/w3c/PNG-spec/issues/60#issuecomment-996674848.

One issue I have with pLLD is the "parallel defilterable" flag, it shouldn't exist.

If defiltering requires data from the previous piece (flag = 0) then a simple parallel decoder would have to wait for the previous piece to finish, defeating the purpose of parallel decoding. Not only would decompression have to finish on the previous piece but defiltering too, defiltering is a linear process.

An optimal decoder uses width * 2 bytes for defiltering plus some fixed overhead, you would have to allocate more memory to buffer the uncompressed piece just to handle this corner case. It would mean additional complexity for most implementations which is undesirable for decoders and worker threads would still have to wait for defiltering to finish on the previous piece.

mARK only allows filter type 0 (None) or 1 (Sub) for the first scanline which do not reference the previous piece, this makes decoding a lot simpler.

To implement a decoder for it you only have to create n-1 workers and join them at the end, memory usage scales linearly, no complicated synchronization is required. It should be straightforward to adapt an existing implementation to run multiple instances of decoders on preassigned regions of the PNG.

If you encounter an unexpected filter value you switch to error recovery, it would be mostly error recovery code you have to add anyway.

The "IDAT inside another IDAT" scenario doesn't seem possible. If all workers have to consume all data in their regions (which is a reasonable requirement) then you are in fact implicitly guarding against that. I wouldn't drop offsets for this reason.

I have an entire page on the rationale behind the mARK chunk and the extension: https://github.com/libspng/png-restart-marker/blob/master/RATIONALE.md

davidje13 commented 1 year ago

I "finished" the reverse-engineering of Apple's iDOT chunk which hackerfactor did the bulk of the work on:

The iDOT chunk contains an arbitrary number of segments, and for each one includes the the row start, row count, and an offset to the first IDAT chunk:

uint32 segment_count
array (size = segment_count):
  uint32 first_row
  uint32 row_count
  uint32 byte_offset_of_IDAT_chunk

where the byte offsets are relative to the start of the iDOT chunk. This means in observed files, the first element in the array is always 0, <segment height>, 40 (since the IDAT appears immediately after the iDOT chunk, and there are always 2 segments, making 28 bytes of iDOT data + 12 chunk bytes = 40).

It seems quite flexible, but also subject to errors

Pros:

supports any row-based segmentation (so e.g. an encoder could tune it to give best performance by making the chunks equal in decoding complexity, rather than output size)
scales to a single segment, or arbitrarily many

Cons:

skipped rows or overlapping segments are possible, and rows could be defined out-of-sequence
requires coordination with IDAT: segments must end in a flush and a new IDAT must begin for each segment
sensitive to any relative movement of the iDOT/IDAT chunks
gives parallel-processing decision making power to the encoder, not the decoder (though presumably the encoder could specify many segments, and the decoder could distribute these to a smaller number of threads which it determines)

I think it's also interesting to see the comment on that thread giving validity to the idea that parallel reading of PNG files is useful in at least one situation:

nolen on 2021-12-01 18:06

I was involved in adding this to PNGs in ~2011 or so. I don't remember the details but you've got this pretty much right.

The reason was indeed performance: on the first retina iPads, decoding PNGs was a huge portion of the total launch times for some apps, while one of the two cores sat completely idle. My guess is that it's still at least somewhat useful today, even with much much faster single-thread performance, because screen sizes have also grown quite a bit.

I don't have any particular interest in parallel processing of PNG files myself, but I thought having more details on iDOT's approach might be useful for this conversation. (but note this is based entirely on reverse engineering and I have no idea if Apple's implementation copes with non-typical values, like adding more segments or using unmatched segment sizes)

ProgramMax commented 1 year ago

Thank you for the work and informing us.

sensitive to any relative movement of the iDOT/IDAT chunks

This one seems like a problem for existing tooling. I think an editor is allowed to move chunks around (within spec limits) or insert chunks. Relative to the start of the first IDAT might have been safer.

nicolas17 commented 1 year ago

@davidje13 Your analysis seems correct. I realized that I never published my analysis (and now I feel bad about the redundant work >_<):

I just found the fields in that iDOT blog post [hackerfactor's] are slightly wrong

the post says:
0: height divisor (number of bands, always 2)
1: unknown, always 0, might be flags
2: divided height
3: unknown, always 0x28
4: first half height
5: second half height
6: idat restart offset, counted from the start of the iDOT chunk

and I think it's actually:
0: number of bands
1: first row of first band (hence 0)
2: height of first band
3: IDAT offset of first band (from start of iDOT chunk, happens to be 0x40 if first IDAT is right after iDOT)
4: first row of second band
5: height of second band
6: IDAT offset of second band (from start of iDOT chunk)
so 7,8,9 would continue for the third band

Additionally, here's a 4-band image: https://data.nicolas17.xyz/idot4.png which I created with an undocumented Apple API: CGImageDestinationAddImage(dest, myImage, (CFDictionaryRef)@{@"kCGImagePropertyPNGBandCount": @4});

fintelia commented 1 year ago

I agree it is very important to specify that a conformant parallel decoder gets exactly the same result as what a sequential decoder would. Which means it MUST fall back to sequential decoding if it detects that:

Consecutive segments overlap, are out of order, or have any space between them
The deflate stream within a segment doesn't end on a byte boundary
There's a backward pointer within the deflate stream that references a previous segment

Further the decoder probably SHOULD produce identical output to sequential decoding if:

A non-final IDAT contains a deflate block with the bfinal bit set

ProgramMax commented 1 year ago

I would revise the "or have any space between them" part. I could imagine the first thread decodes until it gets to its end, which is a reset marker (or even the code before). And I could imagine a gap filled with more reset markers until where thread 2 starts. The purpose of the gap might be to align the second gap's starting location.

For example, on a non-NUMA machine you would want to make sure the two threads don't share a cache line. Otherwise you might get false sharing when writing. On NUMA machines, to avoid false sharing you really need to align on full page sizes. I doubt anyone would want to pad to that. But hey, maybe.

I guess the point is the gap must be filled with reset markers in order to be valid for both sequential and parallel.

fintelia commented 1 year ago

The first thread must consume all the bytes until it gets to the exact location the second thread started. That's the only way to avoid an "IDAT in IDAT" situation. It is entirely possible for there to be a region at the end of the first segment with no image data, either a sequence of empty IDATs or a bunch of empty deflate blocks. But the decoder has to verify that.

w3c / png

Multithreading PNG decoding, "parallel" PNG's #54