mysticmind / reversemarkdown-net

ReverseMarkdown.Net is a Html to Markdown converter library in C#. Conversion is very reliable since HtmlAgilityPack (HAP) library is used for traversing the Html DOM
MIT License
283 stars 67 forks source link

Odd conversion from HTML <a> with embedded <div>s to Markdown #389

Closed kirk-marple closed 4 months ago

kirk-marple commented 6 months ago

Just started using the package, and it's been a great benefit.

I noticed this odd conversion from a complex <a> element to Markdown, which I believe is a bug.

The HTML, from OpenAI's website, has an <img> and <h3> inside <div>s, inside the <a>.

This is found here: https://openai.com/blog/introducing-openai

image

HTML:

<a href="/research/weak-to-strong-generalization" 
   class="ui-link group relative cursor-pointer" 
   aria-label="Weak-to-strong generalization" 
   id="238" 
   type="research-publications">
    <div class="">
        <div class="">
            <div class="">
                <img src="https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=500&amp;height=500"
                     width="1025" 
                     height="1024" 
                     alt="Weak To Strong Generalization" 
                     loading="lazy" 
                     data-nuxt-img="" 
                     sizes="(max-width: 744px) 100vw, (max-width: 1280px) 50vw, 500px" 
                     srcset="https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=400 400w, 
                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=800 800w, 
                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=1000 1000w, 
                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=1400 1400w, 
                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=2000 2000w, 
                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=2600 2600w, 
                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=3200 3200w"
                     aria-hidden="false" 
                     class="w-full">
            </div>
            <!-- Empty div placeholder -->
        </div>
    </div>
    <div class="">
        <h3 id="post3title" 
            class="f-subhead-2 mt-8 decoration-1 underline-offset-1 underline-transparent group-hover:underline-text-primary">
            Weak-to-strong generalization
        </h3>
        <!-- Empty div placeholder -->
        <!-- Empty div placeholder -->
        <div class="f-body-1 mt-4">
            <span aria-hidden="true">Dec 14, 2023</span>
        </div>
        <!-- Empty div placeholder -->
        <!-- Empty div placeholder -->
        <!-- Empty div placeholder -->
    </div>
</a>

Markdown:

- [!\[Weak To Strong Generalization\](https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=500&amp;height=500)
### Weak-to-strong generalization
Dec 14, 2023](/research/weak-to-strong-generalization)

FWIW, ChatGPT suggested this as the Markdown conversion, which doesn't seem right either, since the image sits above the text in the page.

[**Weak-to-strong generalization**](/research/weak-to-strong-generalization)

![Weak To Strong Generalization](https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg)

_Dec 14, 2023_
mysticmind commented 6 months ago

Acknowledge seeing this, will take a look.

mysticmind commented 6 months ago

This one is a bit weird and tricky use case. I will see what best can be done.

mysticmind commented 4 months ago

If you do some pre-processing via HtmlAgilityPack (which the library also uses) to massage the html then you can convert to markdown more effectively. See an example below.

var html = "<a href=\"/research/weak-to-strong-generalization\" \n   class=\"ui-link group relative cursor-pointer\" \n   aria-label=\"Weak-to-strong generalization\" \n   id=\"238\" \n   type=\"research-publications\">\n    <div class=\"\">\n        <div class=\"\">\n            <div class=\"\">\n                <img src=\"https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=500&amp;height=500\"\n                     width=\"1025\" \n                     height=\"1024\" \n                     alt=\"Weak To Strong Generalization\" \n                     loading=\"lazy\" \n                     data-nuxt-img=\"\" \n                     sizes=\"(max-width: 744px) 100vw, (max-width: 1280px) 50vw, 500px\" \n                     srcset=\"https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=400 400w, \n                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=800 800w, \n                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=1000 1000w, \n                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=1400 1400w, \n                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=2000 2000w, \n                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=2600 2600w, \n                             https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&amp;width=3200 3200w\"\n                     aria-hidden=\"false\" \n                     class=\"w-full\">\n            </div>\n            <!-- Empty div placeholder -->\n        </div>\n    </div>\n    <div class=\"\">\n        <h3 id=\"post3title\" \n            class=\"f-subhead-2 mt-8 decoration-1 underline-offset-1 underline-transparent group-hover:underline-text-primary\">\n            Weak-to-strong generalization\n        </h3>\n        <!-- Empty div placeholder -->\n        <!-- Empty div placeholder -->\n        <div class=\"f-body-1 mt-4\">\n            <span aria-hidden=\"true\">Dec 14, 2023</span>\n        </div>\n        <!-- Empty div placeholder -->\n        <!-- Empty div placeholder -->\n        <!-- Empty div placeholder -->\n    </div>\n</a>";

var config = new Config
{
    GithubFlavored = true,
};

html = Cleaner.PreTidy(html, config.RemoveComments);

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

// fetch the anchor node
var anchorNode = doc.DocumentNode.ChildNodes.FindFirst("a");
// get the h3 text
var h3Text = anchorNode.ChildNodes.FindFirst("h3").InnerText.Trim();
// remove all children of anchor tag
anchorNode.RemoveAllChildren();
// set the anchor text as the h3 text
anchorNode.InnerHtml = h3Text;
var massagedHtml = anchorNode.OuterHtml;

// pass the above to converter.

// output: [Weak-to-strong generalization](/research/weak-to-strong-generalization)

I am closing this to use the pre-processing as a solution since these are too much variance to process as a standard conversion routine.