Question around performance

sergioloom commented 1 year ago

I'm considering using the library for processing a lot of text so I'm wondering if performance is something that has been considered in the library code and testing? It would be interesting to add some information about performance in the readme.

jo3-l commented 1 year ago

Sorry, I don't have any concrete benchmarks that I can share. However, given how the library works, I would expect performance to be neither spectacular nor horrible — at its core, it is essentially a wrapper around sets of regular expressions. (When I last worked on this in 2021, I surveyed a number of similar libraries and found most to take a similar approach; as such, I find it unlikely that Obscenity would be substantially slower than its competitors.) Again, apologies I don't have any hard numbers to show; it's been a while since I worked on this.

Some unsolicited advice: Faster string matching routines avoiding the overhead of regular expressions may be applicable, depending on the specifics of your use-case. If you have no need for wildcards, for example, you could use an Aho-Corasick tree, which will match multiple strings at once. If your patterns are short and are comprised of a small alphabet, you could consider the bitap algorithm, which uses bitwise operations to achieve a nice constant factor speedup.

jo3-l commented 1 year ago

I'm going to close this off for now, but feel free to reopen if you have further questions.

jo3-l / obscenity

Question around performance #22