Open migajek opened 9 years ago
Are you saying that only when parsing inside the construct like above does this happen? Or if you just try passing the same HTML into CsQuery does it also produce the same incorrect results?
passing the same HTML to the CQ instance in a separate program instance produces correct results, none of them is malformed.
The problem happens randomly when providing CQ with async/awaited downloaded string, like the above scenario. When debugger stopped on my "trap" (which is triggered only when supposedly malformed element is found) I've created the new instance of CQ in the immediate window and fed it with the same (correct) HTML and it created the malformed result. Yet the same (correct) HTML results in correct results in separate program, and also on the second run of the original program. That's why I call it random bug, and suppose some internal not-async/await-safe cache.
I'm trying to create simple demo reproducing the problem, but no success yet
Just to be sure it's not something already fixed, please update to the pre-release package: https://www.nuget.org/packages/CsQuery/1.3.5-beta5
If that doesn't change things see if you can create a repro and I will take a look.
unfortunately updating to the prerelease didn't fixed the problem.
I was able to reproduce the problem with much simplified link scrapper. The below code is the result of extracting key async-CQ-related functionality from my project. Please excuse me the mess...
first of all, run the collector ( 1 ) and let it run for a while. Once it starts dumping the broken results (on my machine it happens after about "{page:35}" is reached in every "thread") you can terminate it with ctrl + c.
now start the app again and examine any of the dumped results. You'll see the problem.
here's an example log of collecting process
and the source of course ...
// copyright Michal Gajek, any usage except for CsQuery bug fixing is prohibited
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading;
using System.Threading.Tasks;
using CsQuery;
namespace ConsoleApplication1
{
class Collector
{
private static readonly IEnumerable<string> _voivodeships = new[]
{
"dolnoslaskie",
"kujawsko-pomorskie",
"lubelskie",
"lubuskie",
"lodzkie",
"malopolskie",
"mazowieckie",
"opolskie",
"podkarpackie",
"podlaskie",
"pomorskie",
"slaskie",
"swietokrzyskie",
"warminsko-mazurskie",
"wielkopolskie",
"zachodniopomorskie"
};
private string _baseUri;
private CancellationToken cancellationToken;
private Random rnd = new Random();
public Collector(string baseUri, CancellationToken cancellationToken)
{
_baseUri = baseUri;
this.cancellationToken = cancellationToken;
}
public void Work()
{
var rnd = new Random();
while (!cancellationToken.IsCancellationRequested)
{
IEnumerable<Task> tasks = _voivodeships.Select(vname =>
{
Console.WriteLine("Starting {0}, back to {1}", vname, DateTime.Now.AddDays(-14));
return Task.Run(async () => await ProcessClassifieldsListAsync(_baseUri + "/" + vname))
.ContinueWith(task => Console.WriteLine("Done {0}", vname));
});
Task.WhenAll(tasks).Wait(cancellationToken);
Console.WriteLine("Done all voivodeships");
var wait = rnd.Next(1, 20);
Console.WriteLine("Waiting {0} minutes", wait);
Thread.Sleep(wait * 60 * 1000);
cancellationToken.WaitHandle.WaitOne(TimeSpan.FromMilliseconds(wait * 60));
}
}
public async Task ProcessClassifieldsListAsync(string uri)
{
DateTime? newestSeenInSession = null;
while (uri != null)
{
string currentHtml = null;
while (currentHtml == null)
{
if (cancellationToken.IsCancellationRequested)
break;
currentHtml = await DownloadToStringAsync(uri); //DownloadHelper.DownloadToStringAsync(uri, ProxyMgr);
}
if (cancellationToken.IsCancellationRequested)
break;
CQ dom = currentHtml;
var nextPrevLink = dom["a.pageNextPrev"];
uri = null;
if (nextPrevLink.Any())
{
var nextLink = nextPrevLink.Last();
if (nextLink.Text().Contains("następna"))
{
string cls = nextLink.Attr("class");
uri = nextLink.Attr("href");
// the TRAP!
if (!Regex.IsMatch(cls, @"\{page:\d+\}"))
{
Console.WriteLine("PROBLEM FOUND, dumping!");
Console.WriteLine("Links count: {0}", nextPrevLink.Length);
ProblemFound(currentHtml, uri,cls);
}
Console.WriteLine(cls);
}
}
}
}
private string GetDir()
{
return Path.GetDirectoryName(GetType().Assembly.Location);
}
private void ProblemFound(string currentHtml, string uri, string cls)
{
var fn = Path.Combine(GetDir(), String.Format("fail_{0}.dump", RandomString(8)));
File.WriteAllText(fn, cls + Environment.NewLine + currentHtml);
}
private string RandomString(int size)
{
StringBuilder builder = new StringBuilder();
char ch;
for (int i = 0; i < size; i++)
{
ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * rnd.NextDouble() + 65)));
builder.Append(ch);
}
return builder.ToString();
}
private Task<string> DownloadToStringAsync(string uri)
{
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36");
client.Encoding = Encoding.UTF8;
return client.DownloadStringTaskAsync(new Uri(uri));
}
public IEnumerable<string> ListFailDumps()
{
return Directory.EnumerateFiles(GetDir()).Where(fn => Regex.IsMatch(fn, @"fail_.{8}\.dump"));
}
public void ShowFailDump(int i)
{
var fn = ListFailDumps().Skip(i).First();
}
}
class Program
{
static void Main(string[] args)
{
var tk = new CancellationTokenSource();
var c = new Collector("http://olx.pl/nieruchomosci/dzialki/sprzedaz/", tk.Token);
var cmd = "";
while (cmd != "q")
{
Console.WriteLine("Please select: \n[1] run collector\n[2] examine fail dump\n[q] quit");
cmd = Console.ReadLine();
// 1. start collector
if (cmd.StartsWith("1"))
c.Work();
// 2. list dumps
else if (cmd.StartsWith("2"))
{
var dumps = c.ListFailDumps();
if (!dumps.Any())
Console.WriteLine("No dumps to examine");
else
{
foreach (var dump in dumps.Select((x,i) => new {i, Name = x}))
{
Console.WriteLine("[{0}] {1}", dump.i, Path.GetFileName(dump.Name));
}
var num = Console.ReadLine();
int idx = -1;
if (int.TryParse(num, out idx))
{
var dumpLines = File.ReadAllLines(dumps.Skip(idx).First());
var classFound = dumpLines.First();
CQ dom = String.Join(Environment.NewLine, dumpLines.Skip(1));
var classExpected = dom["a.pageNextPrev"].Last().Attr("class");
Console.WriteLine("Found when collecting: {0}", classFound);
Console.WriteLine("Expected (found now): {0}", classExpected);
if (classExpected != classFound)
Console.WriteLine("They're DIFFERENT!");
Console.WriteLine("");
}
}
}
}
}
}
}
Great, I will check it out.
@jamietre I believe I've found the problem. In fact its not about async/await I guess, it's just because of excessive use in my scenario.
All of this happens because of static html tokens dictionary ...
in the method HtmlData.TokenizeImpl
there's a comment which says "if for some reason we go over 65,535, will overflow and crash. no need to check".
the assumption seems to be bad as the overflow check is disabled by default.
anyway my (above) reproduction code just happens to reach the limit of 65535 tokens around page 36 of results.
is there any easy way to get rid of static HtmlData tokens in favor of per-instance scenario?
Well holy crap. I guess I was as wrong as bill gates. I need to look at the source but I am kind of shocked this would ever exceed 65k because I think this is only tag names.. In any event it would probably make more sense to compile with int rather than short tokens. I can't believe I put that in unchecked code, I must have been testing performance optimizations long ago and forgot about it.
Anyway good catch. I should be able to push an update tomorrow. On Dec 1, 2014 6:39 PM, "migajek" notifications@github.com wrote:
@jamietre https://github.com/jamietre I believe I've found the problem. In fact its not about async/await I guess, it's just because of excessive use in my scenario.
All of this happens because of static html tokens dictionary ... in the method HtmlData.TokenizeImpl there's a comment which says "if for some reason we go over 65,535, will overflow and crash. no need to check". the assumption seems to be bad as the overflow check is disabled by default. anyway my (above) reproduction code just happens to reach the limit of 65535 tokens around page 36 of results.
[image: image] https://cloud.githubusercontent.com/assets/581492/5255371/a35246c6-79bb-11e4-9068-8ba6bec20763.png
is there any easy way to get rid of static HtmlData tokens in favor of per-instance scenario?
— Reply to this email directly or view it on GitHub https://github.com/jamietre/CsQuery/issues/170#issuecomment-65159357.
It also stores css class names (that's how I found this tokens dictonary, i was just following the call stack). Replacing short with int is not really a solution for me. My use case is supposed to run continously for weeks. It reaches the limit of 2^16-1 in under 2 minutes of work. Which means its possible to reach 2^32-1 in about 70 days of work, so the limit is reachable in real world usage. Of course you can use long instead which will make it thousands of years in my scenario, but don't forget my scenario speed is limited by the download speeds. Some other use cases might be much much faster. Also, the once parsed tokens are kept forever which is not really memory friendly. is there any justification to keep it that way (besides the amount of work needed to make it per-instance)?
I see your point. The rationale was performance, I wasn't really thinking about long term persistence involving millions of pages. Creating the table on a per-instance basis is certainly a straightforward solution, the only caveat is that a number of the tokens are hardcoded. This means that either a single reference table would have to be populated with all the hardcoded tokens each time the entity was created, or I would need to use the existing static dictionary for the predefined tokens and another one for each instance, meaning two lookups would be potentially be required for tokenization.
I don't have a good sense for how much this would impact performance. Even though you have to worry about download speeds, I assume you are using lots of concurrency and not just waiting around, so you should still be impacted by the actual efficiency of the processing.
In any event this is kind of academic, it's broken and that matters a lot more than micro optimizations. I can fix this and worry about optimization strategies later.
I think the most performant basic approach will be a central cache for the predefined tokens and a local cache for everything else. If there were a way to define initial data in a dictionary at compile time it would be a no brainer to just put it all in each instance, but I don't know if that's possible, this indicates that collection initializers are still executed at runtime:
http://stackoverflow.com/questions/26925831/will-compiler-optimize-collections-initialization
I don't know CsQuery internals enough but it sounds like a good idea to have shared static readonly dictionary for built-in tokens and per-instance dictionary for parsed tokens. Anyway you're right, any approach fixing the problem will do.
AFAIK there's no way to have truly compile-time initialization of Dictionary.
I've done some benchmarking and the population of noChildElements, voidElements ... hasValueAttribute takes ca . 0.05ms on my machine (Core i7, 12GB RAM, Win7)
full source, sorry for the indentation - written in LinqPad http://pastebin.com/yj98nk4C
Any chances to get that fixed somehow soon? My product depents on it and I'd like to avoid switching to some other html parsers :(
Yeah will get it fixed this week.
Hi, it's been a while, any updates here? I'm about to choose HTML parsing engine for a new project, would love to use CsQuery again but I'm afraid of being affected by this issue again
I'm using CQ with async/await (i.e. the async procedure awaits the string to be downloaded, then processes it with CQ) called simultaneously, like this (pseudocode):
it works fine ... in theory.
Sometimes strange bugs happened. When debugging I found out it randomly happens that the CsQuery selectors return malformed elements (like, an element with random "class" attribute which CERTAINLY is not in the document). To make sure I've inspected the html content, and made sure that CsQuery is wrong. An example: the "link" element (see above pseudocode) with
class="ad_1byf"
attribute. When inspectings
variable there's NO "ad_1byf" occurance AT ALL.in the Immediate Window of Visual Studio I've tried to instantiate new CQ element, but running the same selector returns the same malformed code.
Now I suppose it might be related to relying on ThreadStatic or something similar, but I'm no expert, neither I know anything about CsQuery...
any suggestions how to fix that?