Extract links_videos (and links_sounds?)

The links_images-field is very usable for reverse image search and showing thumbnails as part of a search result. Similar links_videos and maybe links_sounds would have equal benefits.

Unfortunately it is messy to extract as is has historically been hacked in different ways. Using iframe was popular at one point:

<iframe width="560" height="315" src="http://example.com/43j5jtfrh398" frameborder="0" >
</iframe>

The problem here is that the only indication of the iframe containing a video and not an image, a HTML page or something else, is the URL for the video and that is in no way guaranteed to have a usable extension. Some ideas:

Only populate links_videos with "guaranteed" videos, i.e. those with known video extensions
Index all iframe#src and move the resolve logic to the GUI, first extracting all the URLs, then requesting their content_type_norm-field

If method 2 is used, it might be better to have a field links_resources with all inlined resources (except images). That would also catch sounds and make it possible to e.g. check is a page was iframed from somewhere.

ukwa / webarchive-discovery

Extract links_videos (and links_sounds?) #287