Closed GarthSnyder closed 5 years ago
Nice work @GarthSnyder . Thank you. for your support. Sorry for the delay but I have had little time lately.
I hope to see more of your pull requests
No worries. Thanks again for SwiftSoup! Very useful.
Hi @scinfu I'm curious as to why this PR was closed instead of merged? I had been using these changes for a while and it made some pretty noticeable improvements to performance.
Hi @chickdan, This pr was merged, you do not see changes in SwiftSoup code?
Scinfu
No I am not seeing the changes in the SwiftSoup code. It appears this PR was accidentally closed instead of being merged.
I remember the PR cant be automerged and i merged manually. Let me do a check and eventually merge it again.
Thanks for catching this, @chickdan. I vaguely remember having received some kind of notification about a merge, so I thought this was in, but I don't see the actual code in master either.
If you look at the current history, the comment for 3c640c3 is "Merge branch 'master' into pr/110", and that commit quickly gets merged back to the current master trunk. However, git diff --name-only 109d717..3c640c3 to see the modified files turns up nothing. So I'm thinking there's likely something hinky with that step.
I can re-port onto the current master to simplify the integration if you like.
@GarthSnyder ok, Thank you 🙏
Thank you , merged and released in version 2.3.0.
Thanks for SwiftSoup!
I've been using it to parse HTML fragments from podcast feeds and reduce them to a defined set of tags. Then I render the trees back out according to my own style sheet. SwiftSoup has been working like a champ - it's super reliable and the API fits well with the things I'm using it for.
My only potential reservation is that parsing can be a bit sluggish. Although, I'd also note that I'm particularly sensitive to performance since part of my goal is to parse and render fragments on the fly while they're being scrolled around on screen.
I spent some time looking at SwiftSoup traces to see if there was any low-hanging fruit that might allow for faster parsing. I didn't find anything really glaring (and I did see the many signs of someone having gone down this road before :-), but I did get some solid improvement through a combination of small changes. Overall, I got the benchmark time down from 6.345 seconds to 1.335, not quite 5X faster.
The main issue seems to be that Swift is really slow to to convert non-strings (e.g., arrays of characters) to strings. Evidently, it does all the Unicode decoding and indexing up front. It's about 10X faster to promote a substring into a string versus extracting the codepoints and reassembling them into a new string.
The big place this shows up is in
CharacterReader
, which currently uses a[UnicodeScalar]
input buffer indexed by Ints. Just converting that to aString.UnicodeScalarView
and using string-native indexing speeds up parsing by about 2X. I'm usually wary ofString
's clunky indexing system, but it's actually not bad at all for this application. Pretty much everythingCharacterReader
was doing fits right in without too much of a mismatch.The other place where this makes a big difference is in
StringBuilder
, which is currently backed by aCharacter
array. Just switching this to strings makes a big difference. Actually, there seems to be a particular optimization for[String].reduce("", +)
, so I made the backing a string array. Nothing actually gets assembled until the composite string is extracted. That speeded up parsing by about 25%.Oddly enough, the other large improvement was from avoiding calls to
String.trimmingCharactersIn
. I'm not sure if this is because of bridging (I think this is actually an NSString method) or becauseCharacterSet
s in general are slow. I just added some code to peek at the first and last characters and make sure there was some reason to do a full trim.Other things:
Attributes
was based on anOrderedDictionary
that wasn't used anywhere else. But the average number of attributes on a tag is so low that it doesn't even seem to be worthwhile to put the attributes in a dictionary. Just removingOrderedDictionary
and putting the attributes in an array that's searched linearly seems to give a significant performance gain. (Although of course, the worst case gets significantly worse.)It looks like a binary search was contemplated for looking up HTML entities, but never actually made it into the code. I added it. It helps a bit, although it's not a huge difference.
Finally, it seems to help if all tag sets are spelled out as static arrays and searched with the native Swift
contains()
method. I'm actually really surprised by this and can't really explain why the difference is so significant.I also did a few experiments on
consumeToAny
, which is a major workhorse:Set
s of characters, giantswitch
statements, making sure everything was constant-ified (as I did for tags). None of that panned out.Anyway, you're welcome to take whatever you like from this. If there's a part you'd like segmented out, just let me know and I can reshuffle the commits.
I should caution that it wasn't entirely clear to me how serious I should be about preserving internal APIs. Most everything seems to be marked
open
orpublic
, even things that seem very plumbing-ish, e.g.,StringBuilder
andCharacterReader
. The only API that has changed in a potentially problematic way isCharacterReader
- with all the indexes changing, I didn't try too hard to maintain exact compatibility. So if that's a goal, this code might need some further adjustment.I didn't update the example project or versioning. I added a PerformanceTest scheme that runs the benchmark test (six pages from Amazon, Wirecutter, Google, Reuters, Wikipedia, and GitHub) on macOS. But I always find the target/scheme/configuration/project system a bit confusing; I hope I didn't screw anything else up there. The benchmark files are static so that performance numbers can be comparable over a longer period.