ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
115 stars 25 forks source link

Extract links_videos (and links_sounds?) #287

Open tokee opened 2 years ago

tokee commented 2 years ago

The links_images-field is very usable for reverse image search and showing thumbnails as part of a search result. Similar links_videos and maybe links_sounds would have equal benefits.

Unfortunately it is messy to extract as is has historically been hacked in different ways. Using iframe was popular at one point:

<iframe width="560" height="315" src="http://example.com/43j5jtfrh398" frameborder="0" >
</iframe>

The problem here is that the only indication of the iframe containing a video and not an image, a HTML page or something else, is the URL for the video and that is in no way guaranteed to have a usable extension. Some ideas:

  1. Only populate links_videos with "guaranteed" videos, i.e. those with known video extensions
  2. Index all iframe#src and move the resolve logic to the GUI, first extracting all the URLs, then requesting their content_type_norm-field

If method 2 is used, it might be better to have a field links_resources with all inlined resources (except images). That would also catch sounds and make it possible to e.g. check is a page was iframed from somewhere.

thomasegense commented 2 years ago

also not forget the simple "<video>" and "<audio>" tag.

And while we at it, maybe also add the <iframe> tag