lexborisov / myhtml

Fast C/C++ HTML 5 Parser. Using threads.
GNU Lesser General Public License v2.1
1.65k stars 147 forks source link

SEGFAULT when parsing CDATA in single threading mode. #163

Open Jean-Daniel opened 5 years ago

Jean-Daniel commented 5 years ago

When using the parser in MyHTML_OPTIONS_PARSE_MODE_SINGLE mode, it is initialized in myhtml_init like this:

        case MyHTML_OPTIONS_PARSE_MODE_SINGLE:
            if((status = myhtml_create_stream_and_batch(myhtml, 0, 0)))
                return status;

As this call specify that is need 0 stream, the myhtml->thread_stream is initialized to NULL.

myhtml->thread_stream = NULL;

But then, when parsing CDATA (in myhtml_tokenizer_state_markup_declaration_open()), the parser try to call myhtml_tree_wait_for_last_done_token(), which try to access unconditionally tree->myhtml->thread_stream->timespec and obviously it crashes (thread_stream is NULL).

Backtrace:

  myhtml_tree_wait_for_last_done_token(tree=., token_for_wait=.) at tree.c:2457
  myhtml_tokenizer_state_markup_declaration_open(tree=., token_node=., html="…", html_offset=413, html_size=378555) at tokenizer.c:943
  myhtml_tokenizer_chunk_process(tree=., html="…", html_length=378555) at tokenizer.c:88
  myhtml_tokenizer_chunk(tree=., html="…", html_length=378555) at tokenizer.c:104
  myhtml_tokenizer_begin(tree=., html="…", html_length=378555) at tokenizer.c:42
  myhtml_parse_fragment(tree=., encoding=MyENCODING_DEFAULT, html="…") at main.c
lexborisov commented 5 years ago

Hi @Jean-Daniel In a single mode, tokens will always be equal and the program will not enter the loop. Do you have an example html where the program in a single mode enter to this loop?

I saw and corrected another problem. Please, try code from master.

Thanks for the report!

Jean-Daniel commented 5 years ago

Sorry, I didn't gave you enough info. I'm actually using the parser to extract some data from html fragments (I only have the content), and I don't really need a full tree. So I'm using the 'after token done' callback, and disable the tree by using MyHTML_TREE_PARSE_FLAGS_WITHOUT_BUILD_TREE.

A quick test reveal that this is the later flag that trigger the bug. Without it, the parser works flawlessly, but when I set this flag, it crashes on CDATA.

#import <myhtml/api.h>

int main(int argc, char **argv) {
  const char *bytes = "<div><![CDATA[ foo ]]></div>";
  size_t length = strlen(bytes);

  myhtml_t* myhtml = myhtml_create();
  myhtml_init(myhtml, MyHTML_OPTIONS_PARSE_MODE_SINGLE, 1, 0);

  myhtml_tree_t* tree = myhtml_tree_create();
  myhtml_tree_init(tree, myhtml);
  myhtml_tree_parse_flags_set(tree, MyHTML_TREE_PARSE_FLAGS_WITHOUT_BUILD_TREE | MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN);

  // parse html (we only have the body)
  myhtml_parse_fragment(tree, MyENCODING_UTF_8, bytes, length, MyHTML_TAG_BODY, MyHTML_NAMESPACE_HTML);
  myhtml_tree_destroy(tree);

  myhtml_destroy(myhtml);
  return 0;
}