scinfu / SwiftSoup

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)
https://scinfu.github.io/SwiftSoup/
MIT License
4.53k stars 345 forks source link

Performance tweaks #110

Closed GarthSnyder closed 5 years ago

GarthSnyder commented 5 years ago

Thanks for SwiftSoup!

I've been using it to parse HTML fragments from podcast feeds and reduce them to a defined set of tags. Then I render the trees back out according to my own style sheet. SwiftSoup has been working like a champ - it's super reliable and the API fits well with the things I'm using it for.

My only potential reservation is that parsing can be a bit sluggish. Although, I'd also note that I'm particularly sensitive to performance since part of my goal is to parse and render fragments on the fly while they're being scrolled around on screen.

I spent some time looking at SwiftSoup traces to see if there was any low-hanging fruit that might allow for faster parsing. I didn't find anything really glaring (and I did see the many signs of someone having gone down this road before :-), but I did get some solid improvement through a combination of small changes. Overall, I got the benchmark time down from 6.345 seconds to 1.335, not quite 5X faster.

The main issue seems to be that Swift is really slow to to convert non-strings (e.g., arrays of characters) to strings. Evidently, it does all the Unicode decoding and indexing up front. It's about 10X faster to promote a substring into a string versus extracting the codepoints and reassembling them into a new string.

The big place this shows up is in CharacterReader, which currently uses a [UnicodeScalar] input buffer indexed by Ints. Just converting that to a String.UnicodeScalarView and using string-native indexing speeds up parsing by about 2X. I'm usually wary of String's clunky indexing system, but it's actually not bad at all for this application. Pretty much everything CharacterReader was doing fits right in without too much of a mismatch.

The other place where this makes a big difference is in StringBuilder, which is currently backed by a Character array. Just switching this to strings makes a big difference. Actually, there seems to be a particular optimization for [String].reduce("", +), so I made the backing a string array. Nothing actually gets assembled until the composite string is extracted. That speeded up parsing by about 25%.

Oddly enough, the other large improvement was from avoiding calls to String.trimmingCharactersIn. I'm not sure if this is because of bridging (I think this is actually an NSString method) or because CharacterSets in general are slow. I just added some code to peek at the first and last characters and make sure there was some reason to do a full trim.

Other things:

I also did a few experiments on consumeToAny, which is a major workhorse: Sets of characters, giant switch statements, making sure everything was constant-ified (as I did for tags). None of that panned out.

Anyway, you're welcome to take whatever you like from this. If there's a part you'd like segmented out, just let me know and I can reshuffle the commits.

I should caution that it wasn't entirely clear to me how serious I should be about preserving internal APIs. Most everything seems to be marked open or public, even things that seem very plumbing-ish, e.g., StringBuilder and CharacterReader. The only API that has changed in a potentially problematic way is CharacterReader - with all the indexes changing, I didn't try too hard to maintain exact compatibility. So if that's a goal, this code might need some further adjustment.

I didn't update the example project or versioning. I added a PerformanceTest scheme that runs the benchmark test (six pages from Amazon, Wirecutter, Google, Reuters, Wikipedia, and GitHub) on macOS. But I always find the target/scheme/configuration/project system a bit confusing; I hope I didn't screw anything else up there. The benchmark files are static so that performance numbers can be comparable over a longer period.

scinfu commented 5 years ago

Nice work @GarthSnyder . Thank you. for your support. Sorry for the delay but I have had little time lately.

I hope to see more of your pull requests

GarthSnyder commented 5 years ago

No worries. Thanks again for SwiftSoup! Very useful.

chickdan commented 4 years ago

Hi @scinfu I'm curious as to why this PR was closed instead of merged? I had been using these changes for a while and it made some pretty noticeable improvements to performance.

scinfu commented 4 years ago

Hi @chickdan, This pr was merged, you do not see changes in SwiftSoup code?

Scinfu

chickdan commented 4 years ago

No I am not seeing the changes in the SwiftSoup code. It appears this PR was accidentally closed instead of being merged.

scinfu commented 4 years ago

I remember the PR cant be automerged and i merged manually. Let me do a check and eventually merge it again.

GarthSnyder commented 4 years ago

Thanks for catching this, @chickdan. I vaguely remember having received some kind of notification about a merge, so I thought this was in, but I don't see the actual code in master either.

If you look at the current history, the comment for 3c640c3 is "Merge branch 'master' into pr/110", and that commit quickly gets merged back to the current master trunk. However, git diff --name-only 109d717..3c640c3 to see the modified files turns up nothing. So I'm thinking there's likely something hinky with that step.

I can re-port onto the current master to simplify the integration if you like.

scinfu commented 4 years ago

@GarthSnyder ok, Thank you 🙏

scinfu commented 4 years ago

Thank you , merged and released in version 2.3.0.