Looking for custom HTML nodes

alexkutsan commented 1 year ago

Hi, thanks for such a wonderful wrapper on top of lexbor.

There is a question regarding custom HTML tags. Looks like it does not support searching or custom tags in the html document: The code

require "lexbor"
BODY = "<p>exiting_tag</p><mytag>mytag</mytag>"
puts Lexbor::Parser.new(BODY).nodes("p").size 
puts Lexbor::Parser.new(BODY).nodes("mytag").size

raises Unknown tag "mytag" exception from https://github.com/kostya/lexbor/blob/41a929d34b1dc799de2753f3273ff9e26f38c145/src/lexbor/utils/tag_converter.cr#L51

Looks like because nodes that allowed to search are limited to enum TagIdT https://github.com/kostya/lexbor/blob/master/src/lexbor/lib/constants.cr#L4

Adding new value to this enum does not help unfortunately - exception is gone, but node still does not appear in results o nodes funciton

Is there some other approach to iterate through custom nodes? or it is a limitation of original lexbor C implementation?

I have tried lexbor C implementation roughly and looks like it is able to extract custom nodes from HTML:

#include <stdio.h>
#include <string.h>
#include <lexbor/html/html.h>
#include <lexbor/html/interfaces/document.h>
#include <lexbor/html/interfaces/element.h>

#define FAILED(...)                                                            \
    do {                                                                       \
        fprintf(stderr, __VA_ARGS__);                                          \
        fprintf(stderr, "\n");                                                 \
        exit(EXIT_FAILURE);                                                    \
    }                                                                          \
    while (0)

void find_tag(lxb_html_document_t *document, const char *tag_name) {
  lxb_dom_element_t *element = lxb_dom_interface_element(
                                  lxb_dom_interface_element(document));
  auto collection = lxb_dom_collection_make(lxb_dom_interface_document(document), 16);
  if (collection == NULL)  FAILED("Failed to create collection");
  size_t tag_size = strlen(tag_name);
  auto status = lxb_dom_elements_by_tag_name(element, collection, (const lxb_char_t *) tag_name, tag_size);
  if (status != LXB_STATUS_OK || lxb_dom_collection_length(collection) == 0)  FAILED("Failed to find tag '%s'", tag_name);
  printf("Found tag '%s'\n", tag_name);
}

int main() {
    static const lxb_char_t html[] = "<html><body><p>hello world</p><mytag>blabla</mytag></body></html>";
    lxb_html_document_t * document = lxb_html_document_create();
    if (document == NULL) FAILED("Failed to create HTML Document");
    auto status = lxb_html_document_parse(document, html, sizeof(html) - 1);
    if (status != LXB_STATUS_OK)  FAILED("Failed to parse HTML");
    lxb_dom_collection_t* collection = lxb_dom_collection_make(&document->dom_document, 16);
    if (collection == nullptr) FAILED("Failed to create collection");

    printf("%s\n", (const char *) html);
    find_tag(document,"p");
    find_tag(document,"body");
    find_tag(document,"mytag");
    return 0;
}

$ g++ lexbor_try.c -Ilexbor/source -Llexbor/ -llexbor && LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`pwd`/lexbor ./a.out
<html><body><p>hello world</p><mytag>blabla</mytag></body></html>
Found tag 'p'
Found tag 'body'
Found tag 'mytag

kostya commented 1 year ago

i think you can try at least 2 ways, not checked btw:

puts Lexbor::Parser.new(BODY).css("mytag").size

puts Lexbor::Parser.new(BODY).root!.scope.select { |tag| tag.tag_name_slice == "mytag".to_slice }.size

this is not bug, just implementation aspect of nodes method.

alexkutsan commented 1 year ago

Thanks! Is root!/css considered as public field? I mean won't it be deleted within next minor version

kostya commented 1 year ago

yes

kostya / lexbor

Looking for custom HTML nodes #38