romansky / dom-to-semantic-markdown

DOM to Semantic-Markdown for use with LLMs
MIT License
664 stars 15 forks source link

Allow hooks for unrecognised elements #4

Closed weyert closed 3 months ago

weyert commented 3 months ago

I would like to suggest an improvement for both markdownASTToString and htmlToMarkdownAST-functions to call a hook or callback for a unrecognised element tag name or element type.

This would allow to cover the case where some pages that heavily rely on XML namespace for custom data by implementing hooks to handle these elements.

For example:

    processUnhandledElement?: (element: Element, options: ConversionOptions, indentLevel: number) => SemanticMarkdownAST[];

    processUnhandledNodeType? (node: SemanticMarkdownAST, indentLevel: number, options: ConversionOptions): string;
romansky commented 3 months ago

Thanks for the feature request!

Added some additional facilities to support overriding as well (same area..), resulting with these new options:

- `overrideElementProcessing?: (element: Element, options: ConversionOptions, indentLevel: number) => SemanticMarkdownAST[] | undefined`: Custom processing for HTML elements.
- `processUnhandledElement?: (element: Element, options: ConversionOptions, indentLevel: number) => SemanticMarkdownAST[] | undefined`: Handler for unknown HTML elements.
- `overrideNodeRenderer?: (node: SemanticMarkdownAST, options: ConversionOptions, indentLevel: number) => string | undefined`: Custom renderer for AST nodes.
- `renderCustomNode?: (node: CustomNode, options: ConversionOptions, indentLevel: number) => string | undefined`: Renderer for custom AST nodes.

Let me know if this is what you had in mind..

weyert commented 3 months ago

Thank you that looks great! I will give a try today.

Maybe these four new options might also help with resolving a weird issue I noticed last night were when you have some content like:

<ul>
            <li>
                <p>Company - Prefix <span class="APIparametervalue">003-</span> i.e. <span class="APIparameterkey">CompanyId</span></p>
            </li>
</ul>

If I run the following:

  const pageContent = `<ul>
            <li>
                <p>Company - Prefix <span class="APIparametervalue">003-</span> i.e. <span class="APIparameterkey">CompanyId</span></p>
            </li>
</ul>`;
  const markdownResults = convertHtmlToMarkdown(pageContent, {
    extractMainContent: true,
    filePath,
    debug: true,
    overrideDOMParser: new new JSDOM().window.DOMParser(),
  });

It generates:

[
  {
    "type": "list",
    "ordered": false,
    "items": [
      {
        "type": "listItem",
        "content": [
          {
            "type": "text",
            "content": "Company - Prefix"
          },
          {
            "type": "text",
            "content": "i.e."
          },
          {
            "type": "text",
            "content": "\n\n"
          }
        ]
      }
    ]
  }
]

reflects in the Markdown as:

- Company - Prefix   i.e.

I think adding this bit of code in htmlToMarkdownAst would solve it but I think I should try if I can solve it with these new additions first.

} else if (elem.tagName.toLowerCase() === 'span') {
     debugLog("Span");
     result.push(...htmlToMarkdownAST(elem, options));
     // Add a new line after the paragraph
     result.push({type: 'text', content: '\n\n'});
romansky commented 3 months ago

Appreciate the feedback! :)

I tried to run the example you provided and it generated this: - Company - Prefix 003- i.e. CompanyId

Looking at the HTML I can't say what exactly I would expect by translating this to Markdown.. thoughts?

weyert commented 3 months ago

Yeah, I would expect something like Company - Prefix 003- i.e. CompanyId so what you are getting looks correct.

Hmm, wondering why it is not working for me. I will upgrade to the latest version. I would like to say your library is really cool. Makes it easier to convert our existing documentation site into Markdown.

romansky commented 3 months ago

That's odd, I can't think of any recent code change that might "fix" this.. but.. this was generated from the current code so 🤷

Thanks for sharing! I'm happy I open sourced it! :)