kostya / lexbor

Fast HTML5 Parser with CSS selectors. This is successor of myhtml and expected to be faster and use less memory.
MIT License
95 stars 14 forks source link

Looking for custom HTML nodes #38

Closed alexkutsan closed 1 year ago

alexkutsan commented 1 year ago

Hi, thanks for such a wonderful wrapper on top of lexbor.

There is a question regarding custom HTML tags. Looks like it does not support searching or custom tags in the html document: The code

require "lexbor"
BODY = "<p>exiting_tag</p><mytag>mytag</mytag>"
puts Lexbor::Parser.new(BODY).nodes("p").size 
puts Lexbor::Parser.new(BODY).nodes("mytag").size 

raises Unknown tag "mytag" exception from https://github.com/kostya/lexbor/blob/41a929d34b1dc799de2753f3273ff9e26f38c145/src/lexbor/utils/tag_converter.cr#L51

Looks like because nodes that allowed to search are limited to enum TagIdT https://github.com/kostya/lexbor/blob/master/src/lexbor/lib/constants.cr#L4

Adding new value to this enum does not help unfortunately - exception is gone, but node still does not appear in results o nodes funciton

Is there some other approach to iterate through custom nodes? or it is a limitation of original lexbor C implementation?

I have tried lexbor C implementation roughly and looks like it is able to extract custom nodes from HTML:

#include <stdio.h>
#include <string.h>
#include <lexbor/html/html.h>
#include <lexbor/html/interfaces/document.h>
#include <lexbor/html/interfaces/element.h>

#define FAILED(...)                                                            \
    do {                                                                       \
        fprintf(stderr, __VA_ARGS__);                                          \
        fprintf(stderr, "\n");                                                 \
        exit(EXIT_FAILURE);                                                    \
    }                                                                          \
    while (0)

void find_tag(lxb_html_document_t *document, const char *tag_name) {
  lxb_dom_element_t *element = lxb_dom_interface_element(
                                  lxb_dom_interface_element(document));
  auto collection = lxb_dom_collection_make(lxb_dom_interface_document(document), 16);
  if (collection == NULL)  FAILED("Failed to create collection");
  size_t tag_size = strlen(tag_name);
  auto status = lxb_dom_elements_by_tag_name(element, collection, (const lxb_char_t *) tag_name, tag_size);
  if (status != LXB_STATUS_OK || lxb_dom_collection_length(collection) == 0)  FAILED("Failed to find tag '%s'", tag_name);
  printf("Found tag '%s'\n", tag_name);
}

int main() {
    static const lxb_char_t html[] = "<html><body><p>hello world</p><mytag>blabla</mytag></body></html>";
    lxb_html_document_t * document = lxb_html_document_create();
    if (document == NULL) FAILED("Failed to create HTML Document");
    auto status = lxb_html_document_parse(document, html, sizeof(html) - 1);
    if (status != LXB_STATUS_OK)  FAILED("Failed to parse HTML");
    lxb_dom_collection_t* collection = lxb_dom_collection_make(&document->dom_document, 16);
    if (collection == nullptr) FAILED("Failed to create collection");

    printf("%s\n", (const char *) html);
    find_tag(document,"p");
    find_tag(document,"body");
    find_tag(document,"mytag");
    return 0;
}

$ g++ lexbor_try.c -Ilexbor/source -Llexbor/ -llexbor && LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`pwd`/lexbor ./a.out
<html><body><p>hello world</p><mytag>blabla</mytag></body></html>
Found tag 'p'
Found tag 'body'
Found tag 'mytag
kostya commented 1 year ago

i think you can try at least 2 ways, not checked btw:

puts Lexbor::Parser.new(BODY).css("mytag").size
puts Lexbor::Parser.new(BODY).root!.scope.select { |tag| tag.tag_name_slice == "mytag".to_slice }.size

this is not bug, just implementation aspect of nodes method.

alexkutsan commented 1 year ago

Thanks! Is root!/css considered as public field? I mean won't it be deleted within next minor version

kostya commented 1 year ago

yes