romansky / dom-to-semantic-markdown

DOM to Semantic-Markdown for use with LLMs
MIT License
607 stars 12 forks source link

Feature request: Option to show page schema from HTML #6

Closed richardreeze closed 1 month ago

richardreeze commented 1 month ago

It would be great if the tool had an option for including the page's schema (which is contained inside the HTML). The same way I can select onlyMainContent, includeHtml, etc...

Example

Here's a simple example of how I currently extract schema from a page's HTML

async function getSchemaData(html) {
    const $ = cheerio.load(html);
    const schemaScripts = $('script[type="application/ld+json"]');
    let schemaData = null;

    schemaScripts.each((index, element) => {
        try {
            const jsonContent = $(element).html();
            schemaData = JSON.parse(jsonContent);
            return false
        } catch (error) {
            console.error("Error parsing JSON:", error);
        }
    });

    return schemaData;
}

It usually looks like this:

{
  "@context": "https://schema.org",
  "@type": "WebPage",
  "@id": "https://mentalmathpro.com",
  "description": "Master mental math and become a human calculator with ease! Our user-friendly website offers 14 and lots of practice to help you.",
  "name": "Mental Math Pro",
  "url": "https://mentalmathpro.com"
}

Benefits

This would provide additional information that could be valuable for various use cases (like SEO analysis and structured data extraction).

romansky commented 1 month ago

Thanks for creating the issue @richardreeze !

This would be a great addition to the tool. The meta-data has semantic value and would definitely benefit some use-cases (like search engines).

Researching this, I think it makes sense to include all the meta-data level parts in the output, like:

As some of the data (especially social network related tags) might overlap with other meta-data, I'm considering having two levels under the option of "includeMetaData"

  1. basic level that includes meta-tags
  2. extended level that includes social tags, jsonld data and possibly other formats (like rdfa) down the road..

As for the output format, I would prefer to avoid pure JSON (though we could add a flag to choose the wanted formatting for the meta-data block), I am preferring YAML for this, so something along the lines of:

---
title: "Page Title"
description: "Page description here."
schema:
  "@context": "https://schema.org"
  "@type": "WebPage"
  # ... rest of the schema
---

# H1 tag ...

Thoughts?

richardreeze commented 1 month ago

This would be great! And yes much cleaner than what I suggested.

I hope this becomes a feature! 🙂

romansky commented 1 month ago

@richardreeze hot from the compiler-

> npx d2m -u https://example.com -meta standard
---
title: "Example Domain"
---

# Example Domain

This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.

 [More information...](https://www.iana.org/domains/example)