Closed kiriakosv closed 2 years ago
This refactoring could facilitate adding more characters to the allowed set. @flooose what do you think?
I seems like it would be fine, but I'd be interested what happens in terms of performance when we start adding more characters to allowed_chars.txt. My understanding of the xls file was that all of the characters marked in green should be allowed. That would be a large regex :)
Any idea of how to test this? Have you tried adding just the Greek alphabet to see if there are any performance hits?
I ran the following code:
require 'benchmark'
STRING_LENGTH = 100_000
regex_filter = /[^a-zA-Z0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"
regex_filter_with_greek = /[^a-zA-ZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter_with_greek = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"
charset = Array('A'..'Z') + Array('a'..'z') + Array(0..9) + %w[ÄÖÜäöüß&*$%] + %w[!^@~\\] + %w[ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ]
my_string = Array.new(STRING_LENGTH) { charset.sample }.join
results = Benchmark.bm do |x|
x.report('gsub with greek:') { my_string.gsub(regex_filter_with_greek, '') }
x.report('gsub no greek:') { my_string.gsub(regex_filter, '') }
x.report('tr with greek:') { my_string.tr(string_filter_with_greek, '') }
x.report('tr no greek:') { my_string.tr(string_filter, '') }
end
For a STRING_LENGTH
of 100
I get:
implementation | user | system | total | real |
---|---|---|---|---|
gsub with greek: | 0.000020 | 0.000002 | 0.000022 | ( 0.000018) |
gsub no greek: | 0.000043 | 0.000001 | 0.000044 | ( 0.000044) |
tr with greek: | 0.000020 | 0.000000 | 0.000020 | ( 0.000020) |
tr no greek: | 0.000007 | 0.000001 | 0.000008 | ( 0.000008) |
For a STRING_LENGTH
of 100.000
I get:
implementation | user | system | total | real |
---|---|---|---|---|
gsub with greek: | 0.018775 | 0.000671 | 0.019446 | ( 0.023128) |
gsub no greek: | 0.040723 | 0.000317 | 0.041040 | ( 0.053153) |
tr with greek: | 0.008596 | 0.000172 | 0.008768 | ( 0.011342) |
tr no greek: | 0.001963 | 0.000133 | 0.002096 | ( 0.002097) |
What I find somewhat surprising, is the fact that gsub
actually improves relatively when the greek characters are added. I ran these multiple times and the improvement was consistent. Having said that, I'm not familiar with the implementation of gsub
.
Interesting. I guess, let's see what the maintainers say.
Ping. Maybe no one is responding because the pull request is still a "draft"?
Closed in favour of #105.
Resolves #92.