openeduhub / metalookup

Provide metadata about domains w.r.t accessibility, licencing, adds, etc.
GNU General Public License v3.0
5 stars 0 forks source link

Concurrently run Accessibility Extractor & Splash #149

Closed MRuecklCC closed 2 years ago

MRuecklCC commented 2 years ago

Currently, MetaLookup first forwards the received URL to splash (to get the HAR and HTML content). Depending on the site, this can take 10-30 seconds. Only after this is completed, the extractors will be called concurrently to extract the information from the complete HAR, URL, HTML-Content structure.

One if these extractors is the Accessibilty extractor which again regularly takes more than 20s. However, this extractor does not need the HAR/HTML-Conentent, meaning we can run it concurrently with splash the moment a request is incoming. This could could down total response times from ~40s to ~20s in some cases!

MRuecklCC commented 2 years ago

I did a couple of tests with a prototype implementation and measured response times for the following URLs with current main branch vs prototype.

As one can see, having this concurrently can save up to 10s of the response times (google and learningapps). For sites with very large DOM and many external resources, the prototype implementation was to naive and recomputed some things multiple times (html.lower() etc.). I assume, that even for those cases a non trivial implementation would achieve an ~10s response time improvement!

MRuecklCC commented 2 years ago

A very hacky other prototype that avoided the expensive reevaluations magaged to get the response time for spiegel-online.defrom 35s to 20s.