suntong / html2md

HTML to Markdown converter
MIT License
203 stars 19 forks source link

Fail to extract some images #19

Closed 097115 closed 7 months ago

097115 commented 8 months ago

(I apologise in advance, this is probably more suitable for a discussion, but since those aren't enabled, I'm posting it here.)

Consider this page, for example: https://www.lrt.lt/en/news-in-english/19/2159940/lithuania-plans-to-deport-kazakh-activist-despite-calls-by-meps

The main content seems to be wrapped with article.article-block but the the images inside the article (not the "header" image) somehow fail to be captured -- the command below returns only the images' credits, text below the photos, but not the images itself:

curl -s https://www.lrt.lt/en/news-in-english/19/2159940/lithuania-plans-to-deport-kazakh-activist-despite-calls-by-meps | html2md -s 'article.article-block' -i 

So, I tried adding div.media-block__container, div.media-block__wrapper or img.media-block__image but with no success :)

But maybe you could advise something? Any hint?

Thanks :)

097115 commented 7 months ago

A follow up: the images in question weren't loading because the src attribute wasn't present (images were lazy-loaded).

Some workarounds are mentioned here: https://github.com/JohannesKaufmann/html-to-markdown/issues/25