ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Not all images extracted from an HTML page #206

Open thomasegense opened 5 years ago

thomasegense commented 5 years ago

When a HTML page is parsed, images are extract into this field: "links_images" Only images defined with <img src=""... > are extraced this way. Many modern webpages now also uses images with defined in the style attribute for an HTML tag, mostly in the div-tag. So this is just more parseing of HTML tages.

Examples: a href -tag style="background:url(img/homeico.png) no-repeat ; width:90px"

div-tag style="background-image:url('https://www.proscenium.dk/wp-content/uploads/2018/12/StatensKunstfond-PR-300x169.jpg');"

Images can also be defined in CSS or