Closed kirk-marple closed 4 months ago
Acknowledge seeing this, will take a look.
This one is a bit weird and tricky use case. I will see what best can be done.
If you do some pre-processing via HtmlAgilityPack (which the library also uses) to massage the html then you can convert to markdown more effectively. See an example below.
var html = "<a href=\"/research/weak-to-strong-generalization\" \n class=\"ui-link group relative cursor-pointer\" \n aria-label=\"Weak-to-strong generalization\" \n id=\"238\" \n type=\"research-publications\">\n <div class=\"\">\n <div class=\"\">\n <div class=\"\">\n <img src=\"https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&width=500&height=500\"\n width=\"1025\" \n height=\"1024\" \n alt=\"Weak To Strong Generalization\" \n loading=\"lazy\" \n data-nuxt-img=\"\" \n sizes=\"(max-width: 744px) 100vw, (max-width: 1280px) 50vw, 500px\" \n srcset=\"https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&width=400 400w, \n https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&width=800 800w, \n https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&width=1000 1000w, \n https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&width=1400 1400w, \n https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&width=2000 2000w, \n https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&width=2600 2600w, \n https://images.openai.com/blob/3ae4eaf0-e103-445d-8974-12da0a9934c0/weak-to-strong-generalization.jpg?trim=0,345,0,310&width=3200 3200w\"\n aria-hidden=\"false\" \n class=\"w-full\">\n </div>\n <!-- Empty div placeholder -->\n </div>\n </div>\n <div class=\"\">\n <h3 id=\"post3title\" \n class=\"f-subhead-2 mt-8 decoration-1 underline-offset-1 underline-transparent group-hover:underline-text-primary\">\n Weak-to-strong generalization\n </h3>\n <!-- Empty div placeholder -->\n <!-- Empty div placeholder -->\n <div class=\"f-body-1 mt-4\">\n <span aria-hidden=\"true\">Dec 14, 2023</span>\n </div>\n <!-- Empty div placeholder -->\n <!-- Empty div placeholder -->\n <!-- Empty div placeholder -->\n </div>\n</a>";
var config = new Config
{
GithubFlavored = true,
};
html = Cleaner.PreTidy(html, config.RemoveComments);
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// fetch the anchor node
var anchorNode = doc.DocumentNode.ChildNodes.FindFirst("a");
// get the h3 text
var h3Text = anchorNode.ChildNodes.FindFirst("h3").InnerText.Trim();
// remove all children of anchor tag
anchorNode.RemoveAllChildren();
// set the anchor text as the h3 text
anchorNode.InnerHtml = h3Text;
var massagedHtml = anchorNode.OuterHtml;
// pass the above to converter.
// output: [Weak-to-strong generalization](/research/weak-to-strong-generalization)
I am closing this to use the pre-processing as a solution since these are too much variance to process as a standard conversion routine.
Just started using the package, and it's been a great benefit.
I noticed this odd conversion from a complex
<a>
element to Markdown, which I believe is a bug.The HTML, from OpenAI's website, has an
<img>
and<h3>
inside<div>
s, inside the<a>
.This is found here: https://openai.com/blog/introducing-openai
HTML:
Markdown:
FWIW, ChatGPT suggested this as the Markdown conversion, which doesn't seem right either, since the image sits above the text in the page.