lexborisov / myhtml

Fast C/C++ HTML 5 Parser. Using threads.
GNU Lesser General Public License v2.1
1.66k stars 147 forks source link

Cannot grab entire text of <td> node when it includes a <br> tag #193

Open samueldaniel opened 4 months ago

samueldaniel commented 4 months ago

I am trying to parse a <table>:

  void parse_metric(
      myhtml_tree_t* tree,
      myhtml_collection_t* rows,
      int row_idx,
      std::string metric_name,
      int col_idx) {
      myhtml_collection_t* cols = myhtml_get_nodes_by_tag_id_in_scope(
          tree, nullptr, rows->list[row_idx], MyHTML_TAG_TD, nullptr);
      if (cols && cols->list && cols->length && col_idx <= cols->length) {
          myhtml_tree_node_t* text_node = myhtml_node_child(cols->list[col_idx]);
          if (text_node) {
              const char* text = myhtml_node_text(text_node, nullptr);
              if (text) {
                  printf("%s: %s\n", metric_name.c_str(), text);
              }
          }
      }
  }

There is only one <table> in the whole tree. So i wrote a function that takes the rows of the table, an index for that row, and then the index of the column i want from that row.

The <td> in question looks like this: <td >CLEAN TARE COMPLETE <br>( 116.0 mT )</td>

I am expecting the text variable to contain CLEAN TARE COMPLETE <br>( 116.0 mT ) or even just CLEAN TARE COMPLETE ( 116.0 mT ).

But all I'm getting is CLEAN TARE COMPLETE. How can I capture the text after the <br> tag?