Closed misos1 closed 4 years ago
The bytes are definitely not undefined and zeroed, as the image buffer is created as a zeroed Vec
of bytes that are later filled. It should still not accept such an image and is probably an error in image-png
where a check was forgotten.
Would not be then better to start with vector of uninitialized values (after are things like this fixed)?
Not really, it's very risky and the benefits are minimal. Allocating a large zeroed vectors is marginally slower, sometimes even not slower at all, than allocating an uninitialized vector.
To add details, current computers can zero memory at something like ~50GiB per second so the total time to zero a vector is not going to be very large. Plus at the end of the process, the contents of the vector will now be in cache so subsequent accesses while decoding will be faster, which can further mitigate the overhead
@HeroicKatora So lets see:
// cargo-deps: rand = "0.7.3"
extern crate rand;
use std::alloc::{alloc, alloc_zeroed, dealloc, Layout};
use std::time::Instant;
use rand::Rng;
fn main()
{
for alloc_func in &[alloc, alloc_zeroed]
{
let mut rng = rand::thread_rng();
let mut useit: usize = 0;
let start = Instant::now();
for _ in 0..1000_000
{
let layout = Layout::from_size_align(rng.gen_range(1_000_000, 10_000_000), 1).unwrap();
let ptr = unsafe {alloc_func(layout)};
useit = useit.wrapping_add(ptr as usize);
unsafe {dealloc(ptr, layout)};
}
println!("{} seconds (sum {:X})", start.elapsed().as_secs_f32(), useit);
}
}
Example output on my machine:
0.08206253 seconds (sum 2079FCB4B2853320)
176.02649 seconds (sum 20792E067CEC8C00)
Allocations are not so slow. Allocating large zeroed memory is much slower than just allocation itself. In this test few thousand times slower. I used allocations from 1 to 10 MB which is about range for typical color images with resolution from about 640x480 to 1920x1080.
I understand that using excess unsafe blocks or giving to user vector of MaybeUninit may not be the best thing. But this can be accomplished without using unsafe blocks for example construct vector with_capacity and push pixels into it (or maybe use size hint with collect).
@fintelia Zeroing memory is nothing special. It is as fast as filling it with any other value. In worst case writing decompressed values can be as fast as writing zeros so in worst case this could mean 50% performance penalty. If it fit into L1 or L2 cache then this lost may be lesser but not because we first wrote zeros into cache. Writing just decompressed data without first writing zeros will be still faster. But if processor has 512 kB of L2 cache per core then it cannot fit even 640x480 color image. And L3 cache is not much faster than memory. The best would be to do some benchmark to see how big is this overhead actually.
@misos1 Marginally slower in the total of reading the image, of course. We are talking about trading som 0.1ms
for on average an image size of 5 million bytes, which can take itself with some formats 20ms
or more. And even that might be compensated due to the zeroing having filled the TLB and other caches, reducing cache misses on subsequent writes of the actual data. The tradeoff of risking a repitition of this needs much more benefits. If you have numbers for the end-to-end benchmark then we might begin to discuss which change to some decoders bring enough benefit but definitely not start at making ImageDecoder
basically an unsafe trait
.
The much more effective change would be to build on the current system to avoid the allocation by allowing reading into a pre-allocated buffer.
@HeroicKatora Yes this makes sense. But 50 GB/s is probably more like multi-core performance. I would expect more like half of that for single-core performance on high end systems. I do not think this could be compensated by filing zeros acting as prefetching for which there are prefetch instructions. But you are right that it is better to fill it with zeros rather than to use uninitialized values. But why not to use Vec::with_capacity
approach? This would pre-allocate buffer and avoid subsequent allocations.
@HeroicKatora It would be like safe "uninitialized" values. If is currently not possible to push individual pixels into vector then its length can be raised row by row and each row filled first with zeros then with decoded image data. Row of image is much smaller and can fit into L1 cache so such zeroing would be really negligible.
@misos1 Because it would commit to Vec<u8>
as the backing buffer which is both infexible and incorrect for higher-aligned types. Maybe we'll have a specialized interface to avoid the filling at some point but I doubt that we can improve with our limited time what not even std
has stabilized (see std::io::Read::initializer
and related experiments). If you want to contribute to this effort it is much more effective for all involved if we focussed on eliminating (or reducing) the number of deallocation-allocation cycles involved.
@HeroicKatora Did you mean Vec<T>
? I do not see why it must be u8
.
I'm interested in how you think ImageDecoder
works with a generic parameter T
when the actual type of samples involved is dynamically chosen by the decoder?
@HeroicKatora Seems I just misunderstood something. You probably meant container type not element type.
No, I did mean the element type. The decoder will dynamically find the channel format which might turn out to be several channels of type u8
, u16
, f32
, or even some heterogeneous ones (see tiff, ..). To be able to provided a &mut Vec<u16>
requires that the image buffer chooses a type based on the dynamic ColorType
information that was decoded. In all instantiations we could find, this would then require either
constexpr if std::is_same
.enum
with a Vec<u8>
and Vec<u16>
was far too inflexible.Ok I see now. Currently are all images loaded into right Vec<T>
but through typecast to [u8]
slice. So you would need to forward Vec<T>
to read_image
of decoders and do things based on type of T
. So it will be probably better to do this after is rust specialization stabilized or const if (maybe currently could be used macros).
The benchmark posted above exhibits really weird effects because it doesn't actually touch the memory that was allocated. On my machine for instance, increasing the allocation size 100x or 1000x actually makes the zeroed case several orders of magnitude faster, and the two functions end up performing the same. Not to mention that the allocation pattern is so simple that the allocator only ever returns a half dozen or so distinct pointers
Try to change second parameter of Layout::from_size_align
from 1 to 256 (or page size maybe). I am now not sure what is happening but also output from my machine would indicate about 31 GB/s bandwidth, but it should be more like 16.
And I want to remark that using jemalloc
instead of the system allocator (on gnu-linux target with kernel 5.14) also brings both into the same range of performance, and makes the alloc_zeroed
perform much better than a manual zeroing. I've added a loop to touch a page of memory with a volatile write of a single 1u8
as well. So your measurements might just be tainted by the default allocator not being very smart but having a small binary profile instead (If I remember correctly, that was the reason to remove jemalloc
somewhere around 1.32
). Full code here
I am on kernel 4.15:
0.31884187 seconds (sum C30DA766F1A26000)
0.63516575 seconds (sum C30DA72CF996D800)
73.62745 seconds (sum C30DA2A8E7616000)
So it has somehow prepared already zeroed pages?
So it has somehow prepared already zeroed pages?
Or it will lazily zero those pages. Maybe that means that it is actually slower if one tries to fill them very rapidly. It also means that there is no concrete better and profiling is much more useful then speculating on the exact effects of eliding some initialization.
This is the magic of MAP_ANONYMOUS
and other mmap
arguments, one can spend an incredible amount of time obsessing over microprofiling here I think. And hence why I said that time is better spent on other, easier and clearer optimizations first.
Yeah, Linux does lazy allocation for calls to mmap. It just records that the allocation happens but doesn't do any work or actually map the page until the application tries to access it. When/if the application does an access to the memory it will get a page fault, at which point the kernel will zero a page of memory and map it at the given address (this is why performance is always the same: Linux requires the page will be zero so there is no extra work for alloc_zeroed vs plain alloc)
Ok but now I have to add that vec![0; ...]
is (probably) not using alloc_zeroed
but does that manual zeroing.
Specifically vec![0; ]
does use alloc_zeroed
as an optimization. (Follow the code flow. It might be a good excercise in reading Rust std
code and getting familiar with its internals. It's much more readable than C++ STL for example.)
Yes right now I found it. I did not check before I wrote that comment. I was wondering why is code with vec![0;]
using calloc and thought it was just some llvm optimization. But seems to take advantage over these "fast" zeroed allocations one still has to use jemalloc which is not default now.
Expected
Program should panic at
image::load_from_memory(bad_png).unwrap()
. Loading invalid image should fail. If there is no IDAT or size of decompressed data is too small to fill whole image then loading fails but not with zero sized IDAT.Actual behaviour
Program runs successfully.
image::load_from_memory(bad_png)
returns Ok. And resulting image is black. Raw bytes are all zero and maybe actually undefined.Reproduction steps