Extend allowed characters' set

kiriakosv commented 3 years ago

Resolves #92.

kiriakosv commented 3 years ago

This refactoring could facilitate adding more characters to the allowed set. @flooose what do you think?

coveralls commented 3 years ago

Coverage remained the same at 100.0% when pulling bdd9ee6a0b2943d035f710b12751715df9ca3cc4 on kiriakosv:extend-allowed-characters into 38f0552a4e553df603ff4bd01b00bf1fb319eb94 on salesking:master.

coveralls commented 3 years ago

Coverage remained the same at 100.0% when pulling bdd9ee6a0b2943d035f710b12751715df9ca3cc4 on kiriakosv:extend-allowed-characters into 38f0552a4e553df603ff4bd01b00bf1fb319eb94 on salesking:master.

flooose commented 3 years ago

I seems like it would be fine, but I'd be interested what happens in terms of performance when we start adding more characters to allowed_chars.txt. My understanding of the xls file was that all of the characters marked in green should be allowed. That would be a large regex :)

Any idea of how to test this? Have you tried adding just the Greek alphabet to see if there are any performance hits?

kiriakosv commented 3 years ago

I ran the following code:

require 'benchmark'

STRING_LENGTH = 100_000

regex_filter = /[^a-zA-Z0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"

regex_filter_with_greek = /[^a-zA-ZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter_with_greek = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"

charset = Array('A'..'Z') + Array('a'..'z') + Array(0..9) + %w[ÄÖÜäöüß&*$%] + %w[!^@~\\] + %w[ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ]
my_string = Array.new(STRING_LENGTH) { charset.sample }.join

results = Benchmark.bm do |x|
  x.report('gsub with greek:') { my_string.gsub(regex_filter_with_greek, '') }
  x.report('gsub no greek:') { my_string.gsub(regex_filter, '') }
  x.report('tr with greek:') { my_string.tr(string_filter_with_greek, '') }
  x.report('tr no greek:') { my_string.tr(string_filter, '') }
end

For a STRING_LENGTH of 100 I get:

implementation	user	system	total	real
gsub with greek:	0.000020	0.000002	0.000022	( 0.000018)
gsub no greek:	0.000043	0.000001	0.000044	( 0.000044)
tr with greek:	0.000020	0.000000	0.000020	( 0.000020)
tr no greek:	0.000007	0.000001	0.000008	( 0.000008)

For a STRING_LENGTH of 100.000 I get:

implementation	user	system	total	real
gsub with greek:	0.018775	0.000671	0.019446	( 0.023128)
gsub no greek:	0.040723	0.000317	0.041040	( 0.053153)
tr with greek:	0.008596	0.000172	0.008768	( 0.011342)
tr no greek:	0.001963	0.000133	0.002096	( 0.002097)

What I find somewhat surprising, is the fact that gsub actually improves relatively when the greek characters are added. I ran these multiple times and the improvement was consistent. Having said that, I'm not familiar with the implementation of gsub.

flooose commented 3 years ago

Interesting. I guess, let's see what the maintainers say.

flooose commented 3 years ago

Ping. Maybe no one is responding because the pull request is still a "draft"?

kiriakosv commented 2 years ago

Closed in favour of #105.

salesking / sepa_king

Extend allowed characters' set #93