jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.15k stars 249 forks source link

CsQuery.HtmlParser.HtmlData.TokenIDs is getting too big! #189

Open ra00l opened 9 years ago

ra00l commented 9 years ago

Hey! First of all, GREAT library. From what I noticed so far, it's pretty fast.

I discovered a problem while using CsQuery on multiple html files, from a WARC archive containing 4gb of data. The property CsQuery.HtmlParser.HtmlData.TokenIDs it getting too big for its own good. In my case, after a few minutes, it reaches over 10mb in size, and because of it, parsing is getting slower and slower.

Since you don't know when to clear it, I would suggest just adding a cleanup method, like:

        public static void ClearCachedTokens()
        {
            TokenIDs = new Dictionary<string, ushort>();
        }
jamietre commented 9 years ago

There's actually an open issue related to this (more specifically, the dictionary overflowing). Several solutions have been discussed such as making the token cache per-instance and using two dictionaries, since the dictionary is intialized with "hardcoded" tokens and it would be expensive to repopulate it each time. Or alternatively using a smart cache that GC's things that are infrequently accessed, or something. I just haven't had any time to do it because I don't use CsQuery during my day job any more.

In the meantime your idea is actually not bad, it would at least make the bug manageable by end users. It's not quite that trivial since it would have to be reinitialized with the static data, but still pretty simple compared to the permanent fix.

ra00l commented 9 years ago

Jamie, thanks for your quick answer. I realised the code I had wasn's working this morning :) So after a bit of trial and error, this seems to work:

public static void ClearCachedTokens()
        {
            nextID = 2;
            TokenIDs = new Dictionary<string, ushort>(); //not necessary actually, it's in the constructor
            Tokens = new List<string>();
            InitInternal();
        }

InitInternal is actually the contents of the static constructor method, moved outside to be reused.

Maybe this helps anyone interested.