lexbor / warc_test

The program for testing HTML encoding/parsing for the Lexbor library by WARC files.
Apache License 2.0
3 stars 1 forks source link

Post results? #1

Open alexkreidler opened 4 years ago

alexkreidler commented 4 years ago

I'm really curious about the results of this test.

How many pages were parsed correctly? How many had errors? Any fatal errors?

I think this could be a very valuable tool to measure both the performance and the forgiveness/flexibility of the parser, and a great way to convince potential users of lexbor's value.

However, most users are going to at least want to see the highlights of the results on a website or in a markdown document instead of having to clone and run the benchmark themselves.

Thanks for all your hard work!

lexborisov commented 4 years ago

Hi @alexkreidler

This test for:

  1. Testing parser on many HTML pages (200+ million) with ASAN and MSAN.
  2. Testing encoding detection and decode from page encoding to UTF-8.

Simply put, this is a test on real pages, close to the real use of a parser.

  1. Get HTML
  2. Determine encoding HTML
  3. Convert encoding to UTF-8
  4. Parsing HTML by single buffer and chunks (stream parsing)

That is, it is a test for reliability, not speed. I will be doing benchmarks shortly. For now you can look for elixir benchmark (this is binding for the lexbor HTML module) test.

Tests for correct tree construction and compliance with the specification are in lexbor itself.