s-yata / marisa-trie

MARISA: Matching Algorithm with Recursively Implemented StorAge
Other
511 stars 89 forks source link

Copy more efficiently in Vector::realloc. #26

Open gsivek opened 4 years ago

gsivek commented 4 years ago

The only two Vectors in grimoire::trie::State have char and History as their value types, which are both trivially copyable. Copying via memcpy is significantly faster than iteratively copying individual elements.

s-yata commented 4 years ago

I'm not sure about "Copying via memcpy is significantly faster than iteratively copying individual elements" in actual cases. Do you have any data about this?

State's Vector<T> members (key_buf_ and history_) are not assumed to be long. In addition, realloc is not called so many times if Agent is reused adequately.

jmr commented 4 years ago

Do you have any data about this?

We did get about a 0.5-1% speedup on a large benchmark that uses marisa-trie as a small part. Microbenchmarking wasn't done or was inadequately documented, but could be done.

In addition, realloc is not called so many times if Agent is reused adequately.

Lack of reuse is probably part of the problem, but harder to solve because higher level APIs wrapping marisa-trie might not operate on the Agent paradigm.

Initially, I thought it would be overkill, but now I'm thinking a nicer solution would be to rewrite this using std::uninitialized_move_n (assuming that's as fast as memcpy) and providing a substitute for pre-C++17.

Similarly, destroy_n etc could be used for the other places.

s-yata commented 4 years ago

Vector<T> is used with the following types:

The above non-trivial types have user-defined copy constructors and assignment operators, but they only copy member variables. There is actually no reason to define them. Maybe we can remove them (use default ones) and optimize Vector<T> for trivially copyable types.

s-yata commented 4 years ago

Hmm, I could not confirm significant difference among for-loop, std::memcpy, and std::copy, even if agent reuse is disabled...

jmr commented 4 years ago

Hmm, I could not confirm significant difference among for-loop, std::memcpy, and std::copy, even if agent reuse is disabled...

I saw marisa-benchmark results ranging from 0.5% slower on prefix search to 3.7% faster on predict search, but unfortunately without any -O options. When I tried to enable -O3, the differences went away.

We are seeing a few percent faster other benchmarks, though. Maybe the optimizer is doing different things between the different benchmarks and it's hard to reproduce. Would you accept that there's some situation where memcpy is faster? Would you merge a version that used something like the realloc_move function that @glebm suggested?