Concurrently run Accessibility Extractor & Splash

MRuecklCC commented 2 years ago

Currently, MetaLookup first forwards the received URL to splash (to get the HAR and HTML content). Depending on the site, this can take 10-30 seconds. Only after this is completed, the extractors will be called concurrently to extract the information from the complete HAR, URL, HTML-Content structure.

One if these extractors is the Accessibilty extractor which again regularly takes more than 20s. However, this extractor does not need the HAR/HTML-Conentent, meaning we can run it concurrently with splash the moment a request is incoming. This could could down total response times from ~40s to ~20s in some cases!

MRuecklCC commented 2 years ago

I did a couple of tests with a prototype implementation and measured response times for the following URLs with current main branch vs prototype.

https://www.br.de/mediathek/podcast/radiowissen/1968-das-ausnahmejahr/467076 reduction from 24s to 18s
https://learningapps.org/788879 from 24 s to 14s
https://amazon.de from 500 error (splash timeout) to 31s
https://google.de from 23s to 14.5s
https://spiegel-online.de from 35s to 68s

As one can see, having this concurrently can save up to 10s of the response times (google and learningapps). For sites with very large DOM and many external resources, the prototype implementation was to naive and recomputed some things multiple times (html.lower() etc.). I assume, that even for those cases a non trivial implementation would achieve an ~10s response time improvement!

MRuecklCC commented 2 years ago

A very hacky other prototype that avoided the expensive reevaluations magaged to get the response time for spiegel-online.defrom 35s to 20s.

openeduhub / metalookup

Concurrently run Accessibility Extractor & Splash #149