Get performance within at least 2x of std::vec - Githubissues

rust-cv / header-vec

Allows one to store a header struct and a vector all inline in the same memory on the heap and share weak versions for minimizing random lookups in data structures

MIT License

5 stars 2 forks source link

Get performance within at least 2x of std::vec #4

Closed vadixidav closed 3 years ago

vadixidav commented 3 years ago

In #2, it was identified that HeaderVec performs several times slower than Vec. While this isn't too unusual, it would be great to go faster. Now that we have benchmarks (see #3), we can use these to push the performance further. Getting within at least 2x the performance of Vec should be reasonable, especially considering that HeaderVec is mostly concerned with reducing random accesses in graphs, so memory latency/random lookup is typically a greater concern. However, there is no reason that the implementation can't go faster than it is today, so this should be done.

This issue is to track performance increases until we get to within a reasonable margin (2x) of Vec. This target can be moved if necessary.

makoConstruct commented 3 years ago

I've got a bit of a speedup by moving the reallocate branch out to its own function and marking it as cold, as RawVec does. I'm not sure what to do next.

(I tried using old_len == old_cap instead of new_len > old_cap, but this seemed to do nothing)

The only other difference I can see in RawVec is that they use alloc::Allocator (the default global, presumably) ::grow() instead of alloc::realloc(). Also something where they very deliberately state assumptions about the layouts. realloc is to be deprecated, so it's possible it's not being maintained very much?

vadixidav commented 3 years ago

Interesting. I will review your PR and see if we can get that merged. We should take a look at grow() as well. If you want to do that, go for it. Right now I am somewhat preoccupied with cv-sfm and ennona, but I can look at this eventually when I have time.

makoConstruct commented 3 years ago

Well, I tried using alloc::alloc::Global.grow() instead of realloc. No noticeable improvements https://github.com/makoConstruct/header-vec/blob/testGA/src/lib.rs#L174

We could try the layout stuff maybe -_- I think we should probably go over and ask for advice from whoever maintains RawVec, at this point.

vadixidav commented 3 years ago

Go for it. For me the largest improvement would actually be if I could store multiple types in separate buffers in the allocation rather than as tuples, but I am busy working on other things at a higher level in Rust CV currently which have a bigger performance impact. Don't feel that you absolutely need to tackle this one problem unless benchmarks are showing that it is important to your downstream performance. Regardless, if you keep working on it, I certainly appreciate it.

makoConstruct commented 3 years ago

I will reach out. It probably isn't the biggest performance concern for my application, but it bugs me.

vadixidav commented 3 years ago

Hah, I know the feeling. Appreciate it.

makoConstruct commented 3 years ago

deallocating after each bench iteration helps a lot: #6

We have reached 2x performance, but the remaining distance from 1x still bugs me. Something specific to TestA as a header type is causing this.

makoConstruct commented 3 years ago

Okay, resolved. Parity reached. The rest of the slowdown was caused by #[repr(align(128))] on TestA. (Out of curiosity, why were you aligning that?)

vadixidav commented 3 years ago

Okay, resolved. Parity reached. The rest of the slowdown was caused by #[repr(align(128))] on TestA. (Out of curiosity, why were you aligning that?)

In bitarray the alignment is set very high (not that high) because it uses an array that must be aligned for SIMD, but I couldn't find a good way to set the alignment properly since it uses const generics (which means the alignment would be based on the const parameter). I just wanted to make sure that alignments higher than the actual size wouldn't break the code.

It seems it can cause performance issues though, which is understandable. Thanks for hunting this stuff down.