PNG with zero sized IDAT is loading successfully as black image

misos1 commented 5 years ago

Expected

Program should panic at image::load_from_memory(bad_png).unwrap(). Loading invalid image should fail. If there is no IDAT or size of decompressed data is too small to fill whole image then loading fails but not with zero sized IDAT.

Actual behaviour

Program runs successfully. image::load_from_memory(bad_png) returns Ok. And resulting image is black. Raw bytes are all zero and maybe actually undefined.

Reproduction steps

extern crate image;

fn main()
{
    let bad_png = b"\
        \x89PNG\x0D\x0A\x1A\x0A\
        \x00\x00\x00\x0D\
            IHDR\
            \x00\x00\x04\x00\
            \x00\x00\x05\xA9\
            \x08\x02\x00\x00\x00\
            \x68\x1B\xF7\x46\
        \x00\x00\x00\x00\
            IDAT\
            \x35\xAF\x06\x1E\
        \x00\x00\x00\x00\
            IEND\
            \xAE\x42\x60\x82\
    ";
    let img = image::load_from_memory(bad_png).unwrap();
    let data = img.as_rgb8().unwrap().iter().copied();
    let zeros = std::iter::repeat(0u8).take(1024 * 1449 * 3);
    assert!(data.eq(zeros), "Are they zero or undefined?");
}

HeroicKatora commented 4 years ago

The bytes are definitely not undefined and zeroed, as the image buffer is created as a zeroed Vec of bytes that are later filled. It should still not accept such an image and is probably an error in image-png where a check was forgotten.

misos1 commented 4 years ago

Would not be then better to start with vector of uninitialized values (after are things like this fixed)?

HeroicKatora commented 4 years ago

Not really, it's very risky and the benefits are minimal. Allocating a large zeroed vectors is marginally slower, sometimes even not slower at all, than allocating an uninitialized vector.

fintelia commented 4 years ago

To add details, current computers can zero memory at something like ~50GiB per second so the total time to zero a vector is not going to be very large. Plus at the end of the process, the contents of the vector will now be in cache so subsequent accesses while decoding will be faster, which can further mitigate the overhead

misos1 commented 4 years ago

@HeroicKatora So lets see:

// cargo-deps: rand = "0.7.3"
extern crate rand;
use std::alloc::{alloc, alloc_zeroed, dealloc, Layout};
use std::time::Instant;
use rand::Rng;

fn main()
{
    for alloc_func in &[alloc, alloc_zeroed]
    {
        let mut rng = rand::thread_rng();
        let mut useit: usize = 0;
        let start = Instant::now();
        for _ in 0..1000_000
        {
            let layout = Layout::from_size_align(rng.gen_range(1_000_000, 10_000_000), 1).unwrap();
            let ptr = unsafe {alloc_func(layout)};
            useit = useit.wrapping_add(ptr as usize);
            unsafe {dealloc(ptr, layout)};
        }
        println!("{} seconds   (sum {:X})", start.elapsed().as_secs_f32(), useit);
    }
}

Example output on my machine:

0.08206253 seconds   (sum 2079FCB4B2853320)
176.02649 seconds   (sum 20792E067CEC8C00)

Allocations are not so slow. Allocating large zeroed memory is much slower than just allocation itself. In this test few thousand times slower. I used allocations from 1 to 10 MB which is about range for typical color images with resolution from about 640x480 to 1920x1080.

I understand that using excess unsafe blocks or giving to user vector of MaybeUninit may not be the best thing. But this can be accomplished without using unsafe blocks for example construct vector with_capacity and push pixels into it (or maybe use size hint with collect).

@fintelia Zeroing memory is nothing special. It is as fast as filling it with any other value. In worst case writing decompressed values can be as fast as writing zeros so in worst case this could mean 50% performance penalty. If it fit into L1 or L2 cache then this lost may be lesser but not because we first wrote zeros into cache. Writing just decompressed data without first writing zeros will be still faster. But if processor has 512 kB of L2 cache per core then it cannot fit even 640x480 color image. And L3 cache is not much faster than memory. The best would be to do some benchmark to see how big is this overhead actually.

HeroicKatora commented 4 years ago

@misos1 Marginally slower in the total of reading the image, of course. We are talking about trading som 0.1ms for on average an image size of 5 million bytes, which can take itself with some formats 20ms or more. And even that might be compensated due to the zeroing having filled the TLB and other caches, reducing cache misses on subsequent writes of the actual data. The tradeoff of risking a repitition of this needs much more benefits. If you have numbers for the end-to-end benchmark then we might begin to discuss which change to some decoders bring enough benefit but definitely not start at making ImageDecoder basically an unsafe trait.

The much more effective change would be to build on the current system to avoid the allocation by allowing reading into a pre-allocated buffer.

misos1 commented 4 years ago

@HeroicKatora Yes this makes sense. But 50 GB/s is probably more like multi-core performance. I would expect more like half of that for single-core performance on high end systems. I do not think this could be compensated by filing zeros acting as prefetching for which there are prefetch instructions. But you are right that it is better to fill it with zeros rather than to use uninitialized values. But why not to use Vec::with_capacity approach? This would pre-allocate buffer and avoid subsequent allocations.

misos1 commented 4 years ago

@HeroicKatora It would be like safe "uninitialized" values. If is currently not possible to push individual pixels into vector then its length can be raised row by row and each row filled first with zeros then with decoded image data. Row of image is much smaller and can fit into L1 cache so such zeroing would be really negligible.

HeroicKatora commented 4 years ago

@misos1 Because it would commit to Vec<u8> as the backing buffer which is both infexible and incorrect for higher-aligned types. Maybe we'll have a specialized interface to avoid the filling at some point but I doubt that we can improve with our limited time what not even std has stabilized (see std::io::Read::initializer and related experiments). If you want to contribute to this effort it is much more effective for all involved if we focussed on eliminating (or reducing) the number of deallocation-allocation cycles involved.

misos1 commented 4 years ago

@HeroicKatora Did you mean Vec<T>? I do not see why it must be u8.

HeroicKatora commented 4 years ago

I'm interested in how you think ImageDecoder works with a generic parameter T when the actual type of samples involved is dynamically chosen by the decoder?

misos1 commented 4 years ago

@HeroicKatora Seems I just misunderstood something. You probably meant container type not element type.

HeroicKatora commented 4 years ago

No, I did mean the element type. The decoder will dynamically find the channel format which might turn out to be several channels of type u8, u16, f32, or even some heterogeneous ones (see tiff, ..). To be able to provided a &mut Vec<u16> requires that the image buffer chooses a type based on the dynamic ColorType information that was decoded. In all instantiations we could find, this would then require either

Calling a generic function which somehows needs to specialize on that type parameter, but it can't because it's Rust and not C++ and there is no constexpr if std::is_same.
Choosing some stabilized and static set of supported buffer and channel types from which we basically just migrated away because the old enum with a Vec<u8> and Vec<u16> was far too inflexible.

misos1 commented 4 years ago

Ok I see now. Currently are all images loaded into right Vec<T> but through typecast to [u8] slice. So you would need to forward Vec<T> to read_image of decoders and do things based on type of T. So it will be probably better to do this after is rust specialization stabilized or const if (maybe currently could be used macros).

fintelia commented 4 years ago

The benchmark posted above exhibits really weird effects because it doesn't actually touch the memory that was allocated. On my machine for instance, increasing the allocation size 100x or 1000x actually makes the zeroed case several orders of magnitude faster, and the two functions end up performing the same. Not to mention that the allocation pattern is so simple that the allocator only ever returns a half dozen or so distinct pointers

misos1 commented 4 years ago

Try to change second parameter of Layout::from_size_align from 1 to 256 (or page size maybe). I am now not sure what is happening but also output from my machine would indicate about 31 GB/s bandwidth, but it should be more like 16.

HeroicKatora commented 4 years ago

And I want to remark that using jemalloc instead of the system allocator (on gnu-linux target with kernel 5.14) also brings both into the same range of performance, and makes the alloc_zeroed perform much better than a manual zeroing. I've added a loop to touch a page of memory with a volatile write of a single 1u8 as well. So your measurements might just be tainted by the default allocator not being very smart but having a small binary profile instead (If I remember correctly, that was the reason to remove jemalloc somewhere around 1.32). Full code here

misos1 commented 4 years ago

I am on kernel 4.15:

0.31884187 seconds   (sum C30DA766F1A26000)
0.63516575 seconds   (sum C30DA72CF996D800)
73.62745 seconds   (sum C30DA2A8E7616000)

So it has somehow prepared already zeroed pages?

HeroicKatora commented 4 years ago

So it has somehow prepared already zeroed pages?

Or it will lazily zero those pages. Maybe that means that it is actually slower if one tries to fill them very rapidly. It also means that there is no concrete better and profiling is much more useful then speculating on the exact effects of eliding some initialization.

HeroicKatora commented 4 years ago

This is the magic of MAP_ANONYMOUS and other mmap arguments, one can spend an incredible amount of time obsessing over microprofiling here I think. And hence why I said that time is better spent on other, easier and clearer optimizations first.

fintelia commented 4 years ago

Yeah, Linux does lazy allocation for calls to mmap. It just records that the allocation happens but doesn't do any work or actually map the page until the application tries to access it. When/if the application does an access to the memory it will get a page fault, at which point the kernel will zero a page of memory and map it at the given address (this is why performance is always the same: Linux requires the page will be zero so there is no extra work for alloc_zeroed vs plain alloc)

misos1 commented 4 years ago

~~Ok but now I have to add that vec![0; ...] is (probably) not using alloc_zeroed but does that manual zeroing.~~

HeroicKatora commented 4 years ago

Specifically vec![0; ] does use alloc_zeroed as an optimization. (Follow the code flow. It might be a good excercise in reading Rust std code and getting familiar with its internals. It's much more readable than C++ STL for example.)

misos1 commented 4 years ago

Yes right now I found it. I did not check before I wrote that comment. I was wondering why is code with vec![0;] using calloc and thought it was just some llvm optimization. But seems to take advantage over these "fast" zeroed allocations one still has to use jemalloc which is not default now.

image-rs / image