Integration test rig for 1.x, 2.0-rc, and the future

drewda commented 6 years ago

Compare 1.4 against 2.0-rc in order to coordinate that all parties’ work is in sync.

Timing: Alongside or after inventory of 1.x API usage (#2621)

t2gran commented 5 years ago

@drewda Hi, you have mentioned a few times that Interline have a test rig that you use to compare OTP and valhalla. Is it an OpenSource project? Can I get access to it?

We also have the OTPQA, but I am not aware of the status. Personally I would like to have a tool that is written in a Java like language so it would be easy to change for Java OTP developers.

At Entur we have 2 tools developed in-house. We have the SpeedTest in R5 witch uses CSV files to run a set of samples and compare the results. The comparison have a good diff function witch allow for checking if the expected result is worse/better. The other tool is our integration tool witch just look at success and response times.

t2gran commented 5 years ago

Goals

Detect decline/improvement in a travel search over time. We need to keep a persistent history of travel plan samples.
- Compare different versions of OTP, other travel planners, or versions of transit data
- Compare performance, mem use?, trip plan samples (pluggable comparison strategy)
- Compare a single test run with reference groups of previous test runs (based on tags?)
- Visualize trends and results over time
- Support for other travel planners than OTP

Primary users and scenarios

CI system - It should be possible to use it in a build pipeline:
- for QA
- to populate the persistent trip plan sample history
Developers - to compare results BEFORE a committing - to speed up development and reduce the feedback loop.
Testers
- to test new releases of transit and OSM data
- to test new versions of OTP
Product Owner and System integrators
- tune the system for optimal performance (response times, memory usage, travel plan quality)

Comparing Trip plans

There is many ways to compare to trip plans, but by using pareto-sets and different scoring function we should be able make something over time that is pretty good. The important thing here is that the test framework allows for custom scoring of a trip plans. The system should store the sample trip plans, not the test results - to allow many different kind of comparisons, and for the tests to evolve.
Using reference groups. It is nice if it is possible to combine previous computed trip plans as a "reference group". All trip plans across a reference group is combined into on trip plan before it is compared with the candidate trip plan. The combining function could be a simple pareto set or a pareto set with reduce function at the end.

sheldonabrown commented 5 years ago

@t2gran we are coincidentally looking to do exactly the same thing. How far have you gotten? Any way we could split this up / work together?

t2gran commented 5 years ago

@sheldonabrown We have not started, and we plan to focus on getting full Transmodel/NeTEx OTP2 up and running (beta version) before we can work on this. If you want to do this I am more than happy to assist in any way I can, design review, code review - we could even have a phone meeting to kick start this. During the development of the OTP2 Raptor (in R5) I made a test rig for comparing a stored "bench mark" result with my running code. There is a lot to learn from this, and even a lot of code which can be take from it. This is probably the most difficult thing to do when creating a tool like this, so I will be happy to contribute that code.

t2gran commented 5 years ago

I have started on "prototype" https://github.com/t2gran/TripPlannerQA. I plan to make the backend part, but not any visualizer.

I have used Kotlin as the main language, and Spook and Groovy for unit testing. To store test-cases and test results I have used MongoDB. These technologies should be familiar to most Java developers and reduce the amount of boilerplate, and allow for fast development. A relational database would probably be better for statistical analysis, but it should be easy to migrate at a later point in time.

barbeau commented 5 years ago

For the record, here's the test rig we built for the OTP deployment at USF - it's a fork of some of the early test rig work that TriMet did for v1.x: https://github.com/CUTR-at-USF/test

I'd love to see a more robust test rig under the official OTP repo that multiple contributors could use - we'd be interested in moving to a new testing rig at USF as effort allows.

t2gran commented 5 years ago

@barbeau Thank you for the update, I will have a lock at the CUTR-at-USF/test.

I created the Trakpi last night - my plan is to develop this during our testing phase of OTP2 - so it is WIP for now.

t2gran commented 5 years ago

Just to make sure we are on the same page - the Trakpi project is for testing the trip planning requests and measuring response quality - it is not an integration test tool, we do not test every endpoint in OTP, just the planning endpoint. The https://github.com/CUTR-at-USF/test is testing many endpoints with assertions - this would be a good compliment to the the new tool. The idea behind Trakpi is not to do assertions on each test case, instead we want to track and compare a large set of test cases over time, across planners and versions.

DerekEdwards commented 4 years ago

@t2gran I work with @sheldonabrown (he commented above) and we are looking to build something very similar to Trakpi. Do you know if anyone has suggested metrics to help identify what a 'good' trip? Defining those metrics sounds like a large question all by itself.

t2gran commented 4 years ago

The idea behind TrackPi is to be very flexible when it comes to a metric on how to measure what a 'good' result is. So the strategy I want to use is to retrieve a large set of results for each test case - possible using several planning request with slightly different search parameters and without filtering the results. In OTP2 we can get a lot of results from just one search. In OTP1 each result is produced with a separate search (baning previously found tripes).

If you manage to get most possible journes, then you can use a pareto-set and a spread function to filter out the result you want base on the criteria you want to optimize for. I want to create several metric function (Performance Indicator) - some as simple as response time, wile others can use the above combination of pareto-set and spread function, jet another can be a function of other performance indicators.

Then say you have the best 20 results for each test case - then you can run your test cases again to se witch config that find the best match based on a scoring function.

DerekEdwards commented 4 years ago

That sounds like a great approach.

One thing that I want to take into account is expected results for specific trip requests.

As we tune the parameters to give better and better results, how do we ensure that certain trips still give the expected results? For example, if I have a set of trips with a matching set of expected outcomes, how can I ensure that those expected outcomes are still met while continuing to improve the overall performance of the results?

barbeau commented 4 years ago

We used our testing rig for two primary cases at USF in the context of trip plans:

To guard against breaking major trips - We planned out some frequently traveled trips in the USF area with an expected number of transfers and specific routes included (or things not included, like driving on pedestrian paths), and then made sure that these all passed as expected before doing a new release of a bundle or OTP code. This made sure that someone didn't make major changes to GTFS, OSM, or OTP code that broke trips traveled by a lot of people in the area.
To document and test solutions for issues we discovered, and prevent regressions - When we found a trip that didn't behave as expected, we'd add a test case for that trip to make sure we didn't regress based on someone changing the OSM data again, etc. Some examples were U-turns not respected, pedestrian areas not used, shortcuts through parking lots, etc.

Here are the test cases we used in a Google Sheet, if you're curious: https://docs.google.com/spreadsheets/d/1f_CTDgQfey5mY1eMO03D7UZ8855D-mxHsfYfsA3c4Zw/edit?usp=sharing

This part of the project was put on hold shortly after we implemented it, so unfortunately we didn't fill out the entire test suite we originally had planned.

t2gran commented 4 years ago

As we tune the parameters to give better and better results, how do we ensure that certain trips still give the expected results? @DerekEdwards

The SpeedTest included in the #2860 PR does this. It uses text files for test-cases and expected results and also dump the actual results so you can copy parts over into the expected. And even try to match "almost identical trips", like walk distance for a leg changed. I used this to develop the Raptor algorithm together with performance metrics. The problem is that in a dynamic world with realtime data and possible changes to the map data and static transit data, it become hard to maintain over time.

In TrackPi we could implement something similar very easy. For example we could have a key indicator on tagged itineraries. When running the test you would get a report saying that 95% of your tagged itineraries were found, and you should be able to navigate to the result and see witch itineraries that you did not find.

If you would like to work on this together I think we should have a meeting to kick-start the development of TrackPi. The people at HSL have done some interesting research on testing OTP and I hope that they also would like to contribute.

DerekEdwards commented 4 years ago

@t2gran Yes, lets have a meeting. Your goals for TrackPi and my goals for OTP have a lot of overlap.

I want to: 1) Develop a way to measure and continuously improve the quality of OTP trips. 2) Develop a set of regressions tests to ensure that as changes and improvements are made to OTP that certain expected itineraries do not change.

Item 2 is the high priority immediate need, but item 1 will be needed soon after.

It seems like TrakPi is intended to handle both of these cases. Let's schedule a meeting when you have some time. We can take that conversation offline. My email is: dedwards@camsys.com

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days

t2gran commented 2 years ago

The Performance Test we have set up works as an integration test, I will close this issue.

opentripplanner / OpenTripPlanner