Closed MRuecklCC closed 2 years ago
I did a couple of tests with a prototype implementation and measured response times for the following URLs with current main branch vs prototype.
https://www.br.de/mediathek/podcast/radiowissen/1968-das-ausnahmejahr/467076
reduction from 24s to 18shttps://learningapps.org/788879
from 24 s to 14shttps://amazon.de
from 500 error (splash timeout) to 31shttps://google.de
from 23s to 14.5shttps://spiegel-online.de
from 35s to 68s As one can see, having this concurrently can save up to 10s of the response times (google and learningapps). For sites with very large DOM and many external resources, the prototype implementation was to naive and recomputed some things multiple times (html.lower()
etc.). I assume, that even for those cases a non trivial implementation would achieve an ~10s response time improvement!
A very hacky other prototype that avoided the expensive reevaluations magaged to get the response time for spiegel-online.de
from 35s to 20s.
Currently, MetaLookup first forwards the received URL to splash (to get the HAR and HTML content). Depending on the site, this can take 10-30 seconds. Only after this is completed, the extractors will be called concurrently to extract the information from the complete HAR, URL, HTML-Content structure.
One if these extractors is the Accessibilty extractor which again regularly takes more than 20s. However, this extractor does not need the HAR/HTML-Conentent, meaning we can run it concurrently with splash the moment a request is incoming. This could could down total response times from ~40s to ~20s in some cases!