jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.15k stars 249 forks source link

Async / await: strange things happen #170

Open migajek opened 9 years ago

migajek commented 9 years ago

I'm using CQ with async/await (i.e. the async procedure awaits the string to be downloaded, then processes it with CQ) called simultaneously, like this (pseudocode):

var tasks = urls.Select(ProcessUrl);
Task.WhenAll(tasks).Wait();
...

async Task<Detail> ProcessUrl(string url)
{
 var s = await DownloadToString(url);
 CQ dom = s;
 var link = dom["a.nextPrevLink"]; 
 // some DOM processing here, resulting in 
 return new Detail { ... };
}

it works fine ... in theory.

Sometimes strange bugs happened. When debugging I found out it randomly happens that the CsQuery selectors return malformed elements (like, an element with random "class" attribute which CERTAINLY is not in the document). To make sure I've inspected the html content, and made sure that CsQuery is wrong. An example: the "link" element (see above pseudocode) with class="ad_1byf" attribute. When inspecting s variable there's NO "ad_1byf" occurance AT ALL.

in the Immediate Window of Visual Studio I've tried to instantiate new CQ element, but running the same selector returns the same malformed code.

Now I suppose it might be related to relying on ThreadStatic or something similar, but I'm no expert, neither I know anything about CsQuery...

any suggestions how to fix that?

jamietre commented 9 years ago

Are you saying that only when parsing inside the construct like above does this happen? Or if you just try passing the same HTML into CsQuery does it also produce the same incorrect results?

migajek commented 9 years ago

passing the same HTML to the CQ instance in a separate program instance produces correct results, none of them is malformed.

The problem happens randomly when providing CQ with async/awaited downloaded string, like the above scenario. When debugger stopped on my "trap" (which is triggered only when supposedly malformed element is found) I've created the new instance of CQ in the immediate window and fed it with the same (correct) HTML and it created the malformed result. Yet the same (correct) HTML results in correct results in separate program, and also on the second run of the original program. That's why I call it random bug, and suppose some internal not-async/await-safe cache.

I'm trying to create simple demo reproducing the problem, but no success yet

jamietre commented 9 years ago

Just to be sure it's not something already fixed, please update to the pre-release package: https://www.nuget.org/packages/CsQuery/1.3.5-beta5

If that doesn't change things see if you can create a repro and I will take a look.

migajek commented 9 years ago

unfortunately updating to the prerelease didn't fixed the problem.

I was able to reproduce the problem with much simplified link scrapper. The below code is the result of extracting key async-CQ-related functionality from my project. Please excuse me the mess...

first of all, run the collector ( 1 ) and let it run for a while. Once it starts dumping the broken results (on my machine it happens after about "{page:35}" is reached in every "thread") you can terminate it with ctrl + c.

now start the app again and examine any of the dumped results. You'll see the problem.

here's an example log of collecting process image

migajek commented 9 years ago

and the source of course ...

// copyright Michal Gajek, any usage except for CsQuery bug fixing is prohibited
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading;
using System.Threading.Tasks;
using CsQuery;

namespace ConsoleApplication1
{
    class Collector
    {
        private static readonly IEnumerable<string> _voivodeships = new[]
        {
            "dolnoslaskie",
            "kujawsko-pomorskie",
            "lubelskie",
            "lubuskie",
            "lodzkie",
            "malopolskie",
            "mazowieckie",
            "opolskie",
            "podkarpackie",
            "podlaskie",
            "pomorskie",
            "slaskie",
            "swietokrzyskie",
            "warminsko-mazurskie",
            "wielkopolskie",
            "zachodniopomorskie"
        };
        private string _baseUri;
        private CancellationToken cancellationToken;
        private Random rnd = new Random();

        public Collector(string baseUri, CancellationToken cancellationToken)
        {
            _baseUri = baseUri;
            this.cancellationToken = cancellationToken;
        }

        public void Work()
        {
            var rnd = new Random();
            while (!cancellationToken.IsCancellationRequested)
            {                
                IEnumerable<Task> tasks = _voivodeships.Select(vname =>
                {                    
                    Console.WriteLine("Starting {0}, back to {1}", vname, DateTime.Now.AddDays(-14));                    
                    return Task.Run(async () => await ProcessClassifieldsListAsync(_baseUri + "/" + vname))
                            .ContinueWith(task => Console.WriteLine("Done {0}", vname));
                });
                Task.WhenAll(tasks).Wait(cancellationToken);

                Console.WriteLine("Done all voivodeships");
                var wait = rnd.Next(1, 20);
                Console.WriteLine("Waiting {0} minutes", wait);
                Thread.Sleep(wait * 60 * 1000);
                cancellationToken.WaitHandle.WaitOne(TimeSpan.FromMilliseconds(wait * 60));
            }
        }

        public async Task ProcessClassifieldsListAsync(string uri)
        {                        
            DateTime? newestSeenInSession = null;
            while (uri != null)
            {
                string currentHtml = null;                
                    while (currentHtml == null)
                    {
                        if (cancellationToken.IsCancellationRequested)
                            break;
                        currentHtml = await DownloadToStringAsync(uri); //DownloadHelper.DownloadToStringAsync(uri, ProxyMgr);                
                    }
                if (cancellationToken.IsCancellationRequested)
                    break;
                CQ dom = currentHtml;

                var nextPrevLink = dom["a.pageNextPrev"];
                uri = null;
                if (nextPrevLink.Any())
                {
                    var nextLink = nextPrevLink.Last();
                    if (nextLink.Text().Contains("następna"))
                    {
                        string cls = nextLink.Attr("class");
                        uri = nextLink.Attr("href");
                        // the TRAP!
                        if (!Regex.IsMatch(cls, @"\{page:\d+\}"))
                        {
                            Console.WriteLine("PROBLEM FOUND, dumping!");
                            Console.WriteLine("Links count: {0}", nextPrevLink.Length);
                            ProblemFound(currentHtml, uri,cls);
                        }
                        Console.WriteLine(cls);
                    }
                }
            }            
        }

        private string GetDir()
        {
            return Path.GetDirectoryName(GetType().Assembly.Location);
        }
        private void ProblemFound(string currentHtml, string uri, string cls)
        {
            var fn = Path.Combine(GetDir(), String.Format("fail_{0}.dump", RandomString(8)));
            File.WriteAllText(fn, cls + Environment.NewLine + currentHtml);
        }

        private string RandomString(int size)
        {
            StringBuilder builder = new StringBuilder();
            char ch;
            for (int i = 0; i < size; i++)
            {
                ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * rnd.NextDouble() + 65)));
                builder.Append(ch);
            }

            return builder.ToString();
        }

        private Task<string> DownloadToStringAsync(string uri)
        {
            WebClient client = new WebClient();
            client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36");
            client.Encoding = Encoding.UTF8;            
            return client.DownloadStringTaskAsync(new Uri(uri));            
        }

        public IEnumerable<string> ListFailDumps()
        {
            return Directory.EnumerateFiles(GetDir()).Where(fn => Regex.IsMatch(fn, @"fail_.{8}\.dump"));
        }

        public void ShowFailDump(int i)
        {
            var fn = ListFailDumps().Skip(i).First();
        }
    }
    class Program
    {
        static void Main(string[] args)
        {
            var tk = new CancellationTokenSource();
            var c = new Collector("http://olx.pl/nieruchomosci/dzialki/sprzedaz/", tk.Token);

            var cmd = "";
            while (cmd != "q")
            {
                Console.WriteLine("Please select: \n[1] run collector\n[2] examine fail dump\n[q] quit");
                cmd = Console.ReadLine();
                // 1. start collector
                if (cmd.StartsWith("1"))
                    c.Work();
                // 2. list dumps
                else if (cmd.StartsWith("2"))
                {
                    var dumps = c.ListFailDumps();
                    if (!dumps.Any())
                        Console.WriteLine("No dumps to examine");
                    else
                    {
                        foreach (var dump in dumps.Select((x,i) => new {i, Name = x}))
                        {
                            Console.WriteLine("[{0}] {1}", dump.i, Path.GetFileName(dump.Name));
                        }
                        var num = Console.ReadLine();
                        int idx = -1;
                        if (int.TryParse(num, out idx))
                        {
                            var dumpLines = File.ReadAllLines(dumps.Skip(idx).First());
                            var classFound = dumpLines.First();
                            CQ dom = String.Join(Environment.NewLine, dumpLines.Skip(1));
                            var classExpected = dom["a.pageNextPrev"].Last().Attr("class");
                            Console.WriteLine("Found when collecting: {0}", classFound);
                            Console.WriteLine("Expected (found now): {0}", classExpected);
                            if (classExpected != classFound)
                                Console.WriteLine("They're DIFFERENT!");
                            Console.WriteLine("");
                        }
                    }
                }
            }
        }

    }
}
jamietre commented 9 years ago

Great, I will check it out.

migajek commented 9 years ago

@jamietre I believe I've found the problem. In fact its not about async/await I guess, it's just because of excessive use in my scenario.

All of this happens because of static html tokens dictionary ... in the method HtmlData.TokenizeImpl there's a comment which says "if for some reason we go over 65,535, will overflow and crash. no need to check". the assumption seems to be bad as the overflow check is disabled by default. anyway my (above) reproduction code just happens to reach the limit of 65535 tokens around page 36 of results.

image

is there any easy way to get rid of static HtmlData tokens in favor of per-instance scenario?

jamietre commented 9 years ago

Well holy crap. I guess I was as wrong as bill gates. I need to look at the source but I am kind of shocked this would ever exceed 65k because I think this is only tag names.. In any event it would probably make more sense to compile with int rather than short tokens. I can't believe I put that in unchecked code, I must have been testing performance optimizations long ago and forgot about it.

Anyway good catch. I should be able to push an update tomorrow. On Dec 1, 2014 6:39 PM, "migajek" notifications@github.com wrote:

@jamietre https://github.com/jamietre I believe I've found the problem. In fact its not about async/await I guess, it's just because of excessive use in my scenario.

All of this happens because of static html tokens dictionary ... in the method HtmlData.TokenizeImpl there's a comment which says "if for some reason we go over 65,535, will overflow and crash. no need to check". the assumption seems to be bad as the overflow check is disabled by default. anyway my (above) reproduction code just happens to reach the limit of 65535 tokens around page 36 of results.

[image: image] https://cloud.githubusercontent.com/assets/581492/5255371/a35246c6-79bb-11e4-9068-8ba6bec20763.png

is there any easy way to get rid of static HtmlData tokens in favor of per-instance scenario?

— Reply to this email directly or view it on GitHub https://github.com/jamietre/CsQuery/issues/170#issuecomment-65159357.

migajek commented 9 years ago

It also stores css class names (that's how I found this tokens dictonary, i was just following the call stack). Replacing short with int is not really a solution for me. My use case is supposed to run continously for weeks. It reaches the limit of 2^16-1 in under 2 minutes of work. Which means its possible to reach 2^32-1 in about 70 days of work, so the limit is reachable in real world usage. Of course you can use long instead which will make it thousands of years in my scenario, but don't forget my scenario speed is limited by the download speeds. Some other use cases might be much much faster. Also, the once parsed tokens are kept forever which is not really memory friendly. is there any justification to keep it that way (besides the amount of work needed to make it per-instance)?

jamietre commented 9 years ago

I see your point. The rationale was performance, I wasn't really thinking about long term persistence involving millions of pages. Creating the table on a per-instance basis is certainly a straightforward solution, the only caveat is that a number of the tokens are hardcoded. This means that either a single reference table would have to be populated with all the hardcoded tokens each time the entity was created, or I would need to use the existing static dictionary for the predefined tokens and another one for each instance, meaning two lookups would be potentially be required for tokenization.

I don't have a good sense for how much this would impact performance. Even though you have to worry about download speeds, I assume you are using lots of concurrency and not just waiting around, so you should still be impacted by the actual efficiency of the processing.

In any event this is kind of academic, it's broken and that matters a lot more than micro optimizations. I can fix this and worry about optimization strategies later.

I think the most performant basic approach will be a central cache for the predefined tokens and a local cache for everything else. If there were a way to define initial data in a dictionary at compile time it would be a no brainer to just put it all in each instance, but I don't know if that's possible, this indicates that collection initializers are still executed at runtime:

http://stackoverflow.com/questions/26925831/will-compiler-optimize-collections-initialization

migajek commented 9 years ago

I don't know CsQuery internals enough but it sounds like a good idea to have shared static readonly dictionary for built-in tokens and per-instance dictionary for parsed tokens. Anyway you're right, any approach fixing the problem will do.

AFAIK there's no way to have truly compile-time initialization of Dictionary.

I've done some benchmarking and the population of noChildElements, voidElements ... hasValueAttribute takes ca . 0.05ms on my machine (Core i7, 12GB RAM, Win7)

image

full source, sorry for the indentation - written in LinqPad http://pastebin.com/yj98nk4C

Any chances to get that fixed somehow soon? My product depents on it and I'd like to avoid switching to some other html parsers :(

jamietre commented 9 years ago

Yeah will get it fixed this week.

migajek commented 9 years ago

Hi, it's been a while, any updates here? I'm about to choose HTML parsing engine for a new project, would love to use CsQuery again but I'm afraid of being affected by this issue again