svanoort / pyresttest

Python Rest Testing
Apache License 2.0
1.15k stars 326 forks source link

Refactor test lifecycles to support parallel operations #31

Open svanoort opened 9 years ago

svanoort commented 9 years ago

As a pyresttest user I'd like to be able to parallelize the test execution (parallel HTTP calls).

TL;DR Summary of Analysis

  1. Worry about parallelizing network I/O first, then the rest, since it's 95% of time in most cases.
    • Remaining overheads is dominated by JSON parsing on extract/validate and curl object creation (caching and curl.reset() will solve that)
  2. Resttest framework methods need to be refactored to isolate parts
    • (Re)Configure curl: Function to (re)generate Curl objects for given test (reusing existing if possible)
    • Execute curl: curl.perform -- multiplexed by CurlMulti or wrapper on same - gotcha: reading body/header.
    • Analyze curl: gather stats, return appropriate result type
    • Reduce results: Summarize benchmarks, add to pass/fail summaries, etc
    • Control flow: Break from loop if needed.

Need to start working out code for the above.

Precursor: using curl reset when reusing curl handles.

Look at using CurlMulti, see example: https://github.com/Lispython/pycurl/blob/master/examples/retriever-multi.py See also: https://github.com/tornadoweb/tornado/blob/master/tornado/curl_httpclient.py

PyCurl Multi Docs: http://pycurl.sourceforge.net/doc/curlmultiobject.html#curlmultiobject LibCurl: http://curl.haxx.se/libcurl/c/libcurl-multi.html

Using multiprocessing pools for process-parallel execution: http://stackoverflow.com/questions/3842237/parallel-processing-in-python.

Concurrency should be managed at a testset level. Why below.

Config syntax:


---
- config:
  concurrency: all  # Maximum, one thread per test run
  concurrency: 1  # single thread, always serial
  concurrency: none  # another way to ensure serial
  concurrency: -1  # yes, this is serial too, as is anything <= 1
  concurrency: 4  # Up to 4 requests at once
  concurrency: 16  # Up to 16 requests at once, if that many tests exist

Implementation: All initial parsing runs, then we decided how to execute (serial or concurrent). For concurrent, I see 4 levels of concurrency, with increasing concurrent resource use and performance, but increasing complexity:

  1. Serial test setup/analysis, parallel network requests
    • Generate tests, then execute batches in parallel with CurlMulti and analyze results serially before next batch.
    • Execution is done using map(...) calls on functions, very clean.
    • Pros:
      • Fairly easy to do (?) with CurlMulti
      • Provides fixed batch execution methods
      • Avoids Process management
      • No worries about synchronization issues with tests themselves
    • Con:
  2. Parallel execution, process does setup/execute/analyze and returns result
    • Each process does a full test/benchmark execution (setup, network call, return)
    • Basically do results = pool.map(tests, run_test)
    • Multiprocessing makes this easy, minimal code changes vs. current
    • Pros:
      • Easy, uses existing methods most effectively
      • Gives a more consistent concurrent load for load testing
      • Fully uses multiple cores
    • Con:
      • Synchronization issues with generators, etc
      • Error handling & logging become a bit broken
      • Requires ability to gather all results at once before processing
      • Process management and similar headaches.
      • May not use networking as efficiently as CurlMulti does
      • Bottlenecked by serial processing to some extent
  3. Controller process, in parallel with a concurrent network I/O process
    • Controller process generates tests and feeds them to a concurrent network request process, which continuously executes them and then returns results async, which get analyzed by main thread.
    • Network I/O uses CurlMulti, single thread does processing
    • Pros:
      • Gives a more consistent concurrent load for load testing
      • Network side fully decoupled from test overheads
    • Con:
      • More complex than above two (combines them)
  4. Controller process, parallel create/analyze processes, parallel network I/O process
    • One controller thread for orchestration which mostly does setup/cleanup
    • Tests are generated and analyzed by process pool
    • A network I/O execution pool receives curl objects to execute and runs callbacks when they complete so they can be processed.
    • Pros:
      • Very efficient
      • Maximum resource use
      • Allows tuning network and CPU bound concurrency separately
      • Very amenable to networked execution, just talk to controller
    • Cons:
      • Very complex
      • Needs to be able to continuously feed in work to analysis process pool (orchestrated by controller)
      • Needs

Analysis:

Test overhead:

Decision Point:

svanoort commented 8 years ago

Simple code example: https://fragmentsofcode.wordpress.com/2011/01/22/pycurl-curlmulti-example/

No need to worry overmuch about true full-parallel. Single executor thread with CurlMulti (concurrent networking) + analyze batch --> final result.

Use a work queue and worker processes if needed to handle/interpret responses (combined again by main thread into results).

AndrewFarley commented 6 years ago

I know this issue is now years old, but I'd like to throw a +1 into the ring. With the current trend in APIs is microservices, there's a huge need arising which is the need to parallelize testing of those APIs. Currently, with a traditional "setup, do test, tear-down" approach to testing microservice APIs for larger APIs can take a (relatively) long time. PyRestTest helps in that regard and allows passing objects/artifacts/values from one to the next to have simplified shared dependencies, but it (currently) runs tests in a non-concurrent fashion. With a bit of effort, each test could be pre-loaded and evaluate what it's dependency is to generate a dependency graph, and the optimize by performing the tests with the most dependencies first, so that its dependents could fire upon its completion. This would allow for full parallelization of the tests, and in the microservice world, could allow for a full barrage of tests to complete in seconds, not minutes/hours.

Imagine the velocity gained by this approach. I've spent a few days googling for an API testing framework (or a testing framework in general) which can do this, and I haven't found any. But, this framework comes awfully close to being able to do this by allowing artifacts/values from previous steps to cascade into the next tests, and by doing things like supporting saving outputs from steps, so that we can do a cleanup afterwards based on those outputs.

I guess I don't see a ton of movement on this project lately, but if it's still active and if anyone's interested, this might be something I would approach doing a PoC of with this framework since I really see the "future" in microservices will require test parallelization.