iabudiab / HTMLKit

An Objective-C framework for your everyday HTML needs.
MIT License
239 stars 27 forks source link

Large memory consumption (spikes) by HTMLParser #10

Closed dsanghan closed 7 years ago

dsanghan commented 7 years ago

If you load a lot of html into HTMLParser, it uses upwards of 500-600MB for parsing at peak. More autorelease pools need to be added. I found a couple of places that are the worst offenders:

- (void)HTMLInsertionModeInBody:(HTMLToken *)token

- (NSString *)innerHTML

HTMLTokenizer nextObject

Also, for a high number of nodes, always creating NSMutableOrderedSet is not ideal - that uses about 40-50MB.

iabudiab commented 7 years ago

@dsanghan Hey there, I'll take a look at the whole memory consumption situation this weekend. Thanks for reporting 👍

iabudiab commented 7 years ago

@dsanghan Hey there, I have just pushed some fixes on the develop branch regarding memory consumption. I would appreciate it if you could test it again and report back if it has improved or not.

I have included only the most obvious fixes and I am working on refactoring the logic of the parser a bit, in order to improve both performance and memory consumption.

Having said that, I could not reproduce the spikes you described. I could monitor a steady increase in memory allocation but no spikes. Could you please describe your setup including the resources you've used, so that I could reproduce the behaviour more accurately?

dsanghan commented 7 years ago

@iabudiab Here's a test html that used to cause pretty large spikes - hopefully this helps.

test.txt

iabudiab commented 7 years ago

@dsanghan Thanks for the test file. I'll make some tests with it and report back, I'll try to free up some time for it ;)

iabudiab commented 7 years ago

@dsanghan Here is the breakdown of the memory consumption based on the file you've provided. I'll be focusing on the main 4 areas that cause the most consumption and I'll break them into 4 separate comments for clarity.

iabudiab commented 7 years ago

HTMLDocument adpotNode:

adpot node

Poblem Adopting a node casuses the underlying childNodes collection to be allocated even if the node has children, which defies the purpose of the lazy allocation introduced in commit: [b693a60358dc352ad23ca9d1999b093c2c215ad4]

Solution Only allocate childNodes collection when inserting child nodes.

iabudiab commented 7 years ago

Node Pre-Insertion validation

preinsertion validation

Problem I must admit that NSStringFromSelector(_cmd) is much more expensive than I've realized.

Solution The memory used here is only an issue when the HTMLKit validations are activated, which is the default setting. Compiling with the HTMLKIT_NO_DOM_CHECKS flag could help, if you really do not want the validations.

Otherwise we could either cache the selector or just use NSString constants. I think I'll go with constants here.

This also affects all other inline validation methods.

iabudiab commented 7 years ago

End Tag parsing in Body

end tag in body

Problems

iabudiab commented 7 years ago

Start Tag parsing in Body

start tag in body

Problems

iabudiab commented 7 years ago

The first 3 issues should be fixed in 8379cee, 1cd1a91 and 47ec086 which also reduced the total allocation size from 75mb to 55mb

The StackOfOpenElements is a bit trickier. I'll post here again once I've optimized it.

iabudiab commented 7 years ago

@dsanghan Hey again, I've just pushed the optimized code for the StackOfOpenElements

The total memory usage is down to 40mb instead of 75mb 🎉 As a side effect the parsing performance increased by about 30% 😁

I'll be releasing a new version soon. For now the improved version is here: 644c180

Thanks again for bringing this issue to my attention. I would appreciate it if you could give it a test and give me feedback 😉

iabudiab commented 7 years ago

Just released version 2.0.6 with the previous fixes 🎉

Should any other memory related issues arise, please feel free to reopen or create a new issue!