icyleaf / markd

Yet another markdown parser, Compliant to CommonMark specification, written in Crystal.
MIT License
109 stars 30 forks source link

Refactor entities encoder #5

Closed straight-shoota closed 7 years ago

straight-shoota commented 7 years ago

This PR renames the custom HTML.escape and HTML.unescape methods because their purpose is to encode and decode HTML entities, not escaping HTML special characters.

I also refactored and simplified some related code and removed unused constants in Rule.

When crystal-lang/crystal#4555 get's merged, the custom implementation of Renderer#escape should be replaced with the one from stdlib.

straight-shoota commented 7 years ago

About the string interpolation: I don't think your benchmark is accurate because it uses a constant and LLVM is probably applying some performance tweaks. When using a dynamic value, I measured interpolation beeing between 1.06 and 1.3 times slower which actually just means a few nanoseconds. I think that's negligible in this context. And concatenation is certainly not faster for multiple values. Only concatenating 2 or 3 (short) strings is seemingly faster than the overhead of initializing a StringBuilder for interpolation.

I don't think it is worth sacrificing good coding style for this little improvement. It might even be possible to improve this performance for single-instance interpolations in the compiler. I think I'll open an issue about that.

If you don't mind, I'd rather use interpolation, but I'm happy to change it if you'd prefer it that way.

Benchmark example:

def foo
  rand.to_s
end

Benchmark.ips do |bm|
  bm.report "+ single" { 10.times { "a" + foo }}
  bm.report "# single" { 10.times { "a#{foo}" }}
end
asterite commented 6 years ago

This PR made everything 100 times slower (I thought it was #7 but it's this PR)

asterite commented 6 years ago

I would strongly suggest to have a benchmark somewhere and, before merging a PR, check if the performance gets better or worse.

asterite commented 6 years ago

It seems Decoder.regex is creating a Regex everytime. If we cache it (for example in a constant) the performance problem goes away.

asterite commented 6 years ago

Also this:

      return @@regex if @@regex.source != "^"

      @@regex = Regex.union(HTMLEntities::ENTITIES_MAPPINGS.values)

can be easily replaced with a constant...

straight-shoota commented 6 years ago

@asterite Thanks for the investigation. I was pretty sure I tested performance impact before pushing, but I'm not certain that I did it on this PR... 🤔

Decoder.regex seems like an unlikely candidate because I can't see there were any significant changes to this method or where it is called... Or am I missing something?

asterite commented 6 years ago

I profiled it with XCode's instruments and it pointed right to that method. I didn't investigate it much more, though.

asterite commented 6 years ago

I'm using Crystal's README.md to benchmark this.

Before this PR there was just a decode method, and it was never called. Now decode_entities is called a lot of times. I won't spend more time investigating it, but it seems the logic was changed. In any case, caching the regex solves the problem, though it would be nice to know why decode is now called when previously it wasn't needed and all specs passed.

straight-shoota commented 6 years ago

Ah yes, decode_entities_string now calls HTML.decode_entities every time directly while before it was hidden behind two regexes. I think this is the right thing to to because it doesn't make sense to check with a regex if there is an backslash or ampersand and then run another regex to replace these. It's just that the regex for decode needs to be cached.