scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Investigate speeding up `MockServer()` #6255

Open wRAR opened 2 months ago

wRAR commented 2 months ago

On my machine, with coverage enabled and h2 installed (both of these things add significant additional time), a single call of tests.mockserver.MockServer(), at least in FeedExportTest.run_and_export(), takes up to 6.5s, which is likely the main reason why our tests, and especially feed export tests (which start the mockserver many times per test function via assertExported*() and in the worst case assertExported()) are so slow. The slowest test, tests/test_feedexport.py::FeedExportTest::test_export_feed_export_fields, actually reaches the 120s limit here because it calls MockServer() 2*2*6=24 times.

wRAR commented 2 months ago

Looks like the main problem I experience is because on this machine trying to resolve an invalid domain (which we do on every import of tests) takes several seconds before getting the resolving error. It looks specific to systemd-resolved in my case (edit: this was a misconfiguration, not intended behavior), but can be possible in other configurations too. So as one improvement we should only call https://github.com/scrapy/scrapy/blob/532cc8a517b31dca4ca28d0a35d25d1a790c9801/tests/__init__.py#L31 when running tests or at least not inside the mockserver process.

wRAR commented 2 months ago

The next step is doing something about the coverage-introduced slowness, it takes about 2 seconds to initialize mockserver under coverage run while it's almost instant otherwise, not sure if anything simple can be done here (also not sure how is coverage being enabled for subprocesses as we don't seem to do anything for that).

wRAR commented 2 months ago

Without coverage and h2 the start up time of mockserver here is about 0.4s, under coverage run but without h2 it's about 1.2s, under coverage run with h2 it's about 2.3s, and cProfile clearly shows that importing hpack.huffman_table takes about 1.1s (but it's instant without coverage). It doesn't seem possible to tell Twisted to not try importing HTTP2 modules and passing --source=scrapy doesn't make coverage ignore that file. Another slow coverage-specific thing is importing html.entities (about 0.3s).