salesking / sepa_king

Ruby gem for creating SEPA XML files
MIT License
149 stars 118 forks source link

Extend allowed characters' set #93

Closed kiriakosv closed 2 years ago

kiriakosv commented 3 years ago

Resolves #92.

kiriakosv commented 3 years ago

This refactoring could facilitate adding more characters to the allowed set. @flooose what do you think?

coveralls commented 3 years ago

Coverage Status

Coverage remained the same at 100.0% when pulling bdd9ee6a0b2943d035f710b12751715df9ca3cc4 on kiriakosv:extend-allowed-characters into 38f0552a4e553df603ff4bd01b00bf1fb319eb94 on salesking:master.

coveralls commented 3 years ago

Coverage Status

Coverage remained the same at 100.0% when pulling bdd9ee6a0b2943d035f710b12751715df9ca3cc4 on kiriakosv:extend-allowed-characters into 38f0552a4e553df603ff4bd01b00bf1fb319eb94 on salesking:master.

flooose commented 3 years ago

I seems like it would be fine, but I'd be interested what happens in terms of performance when we start adding more characters to allowed_chars.txt. My understanding of the xls file was that all of the characters marked in green should be allowed. That would be a large regex :)

Any idea of how to test this? Have you tried adding just the Greek alphabet to see if there are any performance hits?

kiriakosv commented 3 years ago

I ran the following code:

require 'benchmark'

STRING_LENGTH = 100_000

regex_filter = /[^a-zA-Z0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"

regex_filter_with_greek = /[^a-zA-ZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter_with_greek = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"

charset = Array('A'..'Z') + Array('a'..'z') + Array(0..9) + %w[ÄÖÜäöüß&*$%] + %w[!^@~\\] + %w[ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ]
my_string = Array.new(STRING_LENGTH) { charset.sample }.join

results = Benchmark.bm do |x|
  x.report('gsub with greek:') { my_string.gsub(regex_filter_with_greek, '') }
  x.report('gsub no greek:') { my_string.gsub(regex_filter, '') }
  x.report('tr with greek:') { my_string.tr(string_filter_with_greek, '') }
  x.report('tr no greek:') { my_string.tr(string_filter, '') }
end

For a STRING_LENGTH of 100 I get:

implementation user system total real
gsub with greek: 0.000020 0.000002 0.000022 ( 0.000018)
gsub no greek: 0.000043 0.000001 0.000044 ( 0.000044)
tr with greek: 0.000020 0.000000 0.000020 ( 0.000020)
tr no greek: 0.000007 0.000001 0.000008 ( 0.000008)

For a STRING_LENGTH of 100.000 I get:

implementation user system total real
gsub with greek: 0.018775 0.000671 0.019446 ( 0.023128)
gsub no greek: 0.040723 0.000317 0.041040 ( 0.053153)
tr with greek: 0.008596 0.000172 0.008768 ( 0.011342)
tr no greek: 0.001963 0.000133 0.002096 ( 0.002097)

What I find somewhat surprising, is the fact that gsub actually improves relatively when the greek characters are added. I ran these multiple times and the improvement was consistent. Having said that, I'm not familiar with the implementation of gsub.

flooose commented 3 years ago

Interesting. I guess, let's see what the maintainers say.

flooose commented 3 years ago

Ping. Maybe no one is responding because the pull request is still a "draft"?

kiriakosv commented 2 years ago

Closed in favour of #105.