netarchivesuite / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
0 stars 0 forks source link

Index content-links #3

Closed tokee closed 7 years ago

tokee commented 7 years ago

Links to JavaScript, CSS and inline images should be indexed on par with outgoing (a href) links. This opens up for efficient statistics calculation on resource use.

thomasegense commented 7 years ago

Implementing images in new solr field. css/javascript are too fuzzy to extract (@import. recursive references, inline css/js in html pages)

thomasegense commented 7 years ago

Image links implemented. new solr field and new config property to activate it