CPU optimisation ideas - Githubissues

animetosho commented 4 weeks ago

Proposal:

searchd uses a fair amount of CPU on my server, so I decided to attach perf to see what took most of the time.

The following is mostly observations I've made, without having investigated it much, so apologies for lack of details and such. In spite of the difficulty with some of these, I mostly list these as potential future opportunities for optimisation, should there ever be any interest.

For background, I'm using plain indexes, and items that topped the profile include:

sphstd::Sort<ExtPayloadEntry_t, SphLess_T<ExtPayloadEntry_t>, SphAccessor_T<ExtPayloadEntry_t> >
It seems like this struct contains two DWORDs integers, and the comparator checks them individually. On a 64-bit platform (most platforms today), it'd be quicker to perform a single 64-bit compare. Perhaps the members could be wrapped in a union, and if the sizes match, the comparator just does a 64-bit compare instead of two 32-bit ones
ThinMMapReader_c::UnzipInt
I noticed that a bunch of the Unzip* functions grab data one byte at a time. If the length of the buffer is known ahead of time, it might be faster to process multiple bytes at a time.
Also, if the processor has a fast BMI2 implementation, instructions like PEXT could greatly speed up unzipping
sphIsUTF8
This function processes a string one byte at a time, which doesn't scream fast. SIMD could greatly accelerate this (as well as the various other UTF8 functions). Libaries like simdutf might be worth investigating

It seems like most of this originated from sphinxsearch code, so the above is likely applicable there too.

Hope this post provided some value, however small.

Checklist:

^{To be completed by the assignee. Check off tasks that have been completed or are not applicable.}

- [ ] Implementation completed - [ ] Tests developed - [ ] Documentation updated - [ ] Documentation reviewed - [ ] [Changelog](https://docs.google.com/spreadsheets/d/1mz_3dRWKs86FjRF7EIZUziUDK_2Hvhd97G0pLpxo05s/edit?pli=1&gid=1102439133#gid=1102439133) updated - [x] OpenAPI YAML updated and issue created to rebuild clients

sanikolaev commented 4 weeks ago

Thanks @animetosho . We'll discuss your findings.

tomatolog commented 4 weeks ago

what server box do you have?

what CPU and IO utilization do you see?

what kind of queries do you use?

animetosho commented 4 weeks ago

Appreciate the response!

If it wasn't clear, I'm not reporting an issue here, thus my server/CPU/IO/queries aren't really relevant.
I'm pointing out possible changes to the code which could reduce CPU usage.

tomatolog commented 4 weeks ago

I asked you about the complete cases these we could reproduce and investigate locally.

As I do not think that we should optimize low level for 1-3% speed boost code while high level such as secondary indexes instead of the full-scan or wand instead of the plain matching gives us order of magnitude speed ups

animetosho commented 3 weeks ago

I see. My setup is (I think) a fairly straightforward plain + delta index. A simplified version of the config looks like:

source src_main {
    sql_field_string = name
    sql_attr_timestamp = dateline
}
source src_main_delta : src_main {}
index idx_main {
    source          = src_main
    dict    = keywords
    min_infix_len       = 2
    expand_keywords     = 1
    ngram_len       = 1
}
index idx_main_delta : idx_main {
    source          = src_main_delta
    killlist_target = idx_main
}
index idx_main_combined {
    type = distributed
    local = idx_main
    local = idx_main_delta
}
searchd {
    collation_server        = utf8_general_ci
    collation_libc_locale = en_US.UTF-8
}

Most queries are in the form:

SELECT id
FROM idx_main_combined
WHERE MATCH('"keyword1" "keyword2" "keyword3"')
ORDER BY dateline DESC
LIMIT 75 OPTION max_matches=7500;

Note that I wasn't really trying to optimise for my specific use case here - just providing a suggestion from observations I've made.

As I do not think that we should optimize low level for 1-3% speed boost code

I generally find performance to be a bit of a "death by 1000 cuts" situation, so putting together various micro-optimisations can help. Sure, it won't deliver the same benefit as a proper high level setup, but I find this often isn't possible.

I can understand the aversion to avoiding micro-optimisations, but would you be interested in PRs that may address them regardless?

I did a search for secondary indexes but couldn't find much info - what I could find seemed to suggest that it cannot be used with a fulltext index - am I correct?

Thanks again for the response!

animetosho commented 3 weeks ago

Here's an example solution for the first point in the initial post: https://godbolt.org/z/n9fKbj9fn

I haven't done any testing, but the compiled output looks much better, and since it's a core part of sorting, I can imagine it having some impact.
I don't think the additional code needed is particularly significant, and you may also not need to be as defensive as I'm being.

tomatolog commented 3 weeks ago

yes You could post PR with your code changes and reference this thicket there.

Please also post numbers - what speed up you get with your patch related to base code

animetosho commented 3 weeks ago

Understood.

I'm not sure if I can get much in terms of numbers, but I'll see what I can do.

animetosho commented 2 weeks ago

I've been looking more into the third point in the first post and was wondering whether I got this right:

It looks like every instance of sphIsUTF8 is paired with a call to sphUTF8ToWideChar (and vice versa).
It seems like these are only used during wildcard matching, thus the results of these two functions always end up going to sphWildcardMatch.

From what I can tell, the idea is to decode the UTF-8 text to codepoints, which can be skipped if all text is ASCII. This allows comparison to be performed on codepoints, but since they can either be 8-bit or 32-bit (due to possible UTF-8 decoding), there's 4 versions of the matching function to cater to all possibilities.

If we assume the input UTF-8 is always valid, codepoint comparison only seems to matter when matching against % or ? as they have the notion of a single codepoint. Thus if the pattern contains neither wildcard, this UTF-8 detection + decode is completely unnecessary. To actually handle % and ? though, perhaps the wildcard matching function could just be made UTF-8 aware for those two tokens, which eliminates the need for upfront conversion.
(though this might not be possible with the dynamic programming variant of the function due to how it works)

But if the input contains invalid UTF-8, the behaviour can differ. I'm not sure if this is possible or if the behaviour is defined here, but I note that sphUTF8ToWideChar will write -1 when invalid UTF-8 is detected. However, since sphUTF8ToWideChar doesn't perform full validation, it leads me to think that behaviour isn't strictly defined when invalid sequences are present.

I also note that sphWildcardMatchSpec always checks the pattern to determine the matching approach it should take, however since the pattern is often re-used, it might make sense to check it once ahead of time, instead of doing it on every sphWildcardMatch invocation.

Is my understanding roughly accurate, and would these be sensible changes?

(it's also unfortunate that null terminated strings are used everywhere in Sphinxsearch, which can restrict ooportunities)

sanikolaev commented 2 weeks ago

Is my understanding roughly accurate, and would these be sensible changes?

@glookka please advise

glookka commented 2 weeks ago

Yes, your understanding is mostly accurate. You'll need to create and run benchmarks to determine if implementing the changes makes sense.

animetosho commented 2 weeks ago

Thanks for confirming @glookka !

manticoresoftware / manticoresearch

CPU optimisation ideas #2660

Proposal:

Checklist: