Use in-memory store for station lookups.

Bertware commented 7 years ago

The problem: Too long execution time for station lookups

At this moment, both getStationsand getStationFromID loop over the entire stations list. This consumes valuable time resources. getStations also does quite some text replacing to match different spellings for the same station.

Given that a single liveboard can easily contain 25 entries, every reduction of the execution time by 10ms, would result in a possible reduction up to 250ms. This means even small gains are worth it. As a possible side-effect, the php code might run more efficient, offloading the server a little.

Option 1: SQLite

The entire matching could be moved to an SQLite database. Here, we would have a table, which contains the standard name and the alternative name. A station could have multiple entries here, one for every special spelling of that station. If a station doesn't have special/different spellings, it's not in the table.

1) getStations($name) would perform a query, checking this table to see if we're handling a special name. If so, the original name would be returned. If not, we continue working with the original name parameter.

2) we now need to lookup the station with it's data. For this purpose, we could load the jsonld in sqlite too. We could set id as primary key and station name as indexed. Searching would be done on the name parameter, using LIKE $name% and load the entire rows. We parse the rows and return. This is expected to be a performance boost compared to looping over the entire file.

3) searching by id would happen on the same table mentioned in step 2. Only, we'd now search on id. Again, we load the entire row, parse and return.

We could load the stations in an sqlite database (containing both tables) using a php script, which could be manually run after running the build script. (Or call it from the build script).

To load the special spellings in sqlite, we could use the same php script, and a station-alternatives.csv files, which would contain the same data as the table would (original name | special spelling). This way, we can easily add names and run the script to generate a database table. An alternative would be to allow an unlimited amount of special names on one line, so you'd have the original name followed by all special spellings.

This would also clean up the code a little, since all matching code would be replaced by a single query.

Option 2: In-memory store

We could consider keeping the results in a memory cache. This cache could be filled by a cron job, so we would always have a cache hit. This seems an easier, but less flexible solution. This also seems to be more server and config dependent, compared to the guaranteed, but likely lower performance of sqlite.

Option 3: Combine

The best of 2 worlds: we could use the sqlite followed by an in-memory store. Every cache miss would still get a good performance, the memory store would give a little extra boost.

These performance improvements might help #88 a little. (Not a duplicate, as this is one specific approach for a specific speedup, whereas #88 is a lot broader).

I could implement the sqlite version in a fork to compare performance, after which we could evaluate if we switch? Input and discussion wanted. @pietercolpaert

Bertware commented 7 years ago

Another advantage of sqlite is that we can run queries on other fields too, like stations near another station, or stations within x km from a given set of coordinates. This might make the code a little more future-proof.

First tests with SQLite and memcached.

Tests are run as 50 station queries from a brussels-south liveboard. They are run a single time, to simulate a station request (test1) and 100 times for server average across multiple clients (test2). Tests are run inside a virtualbox instance. For the memcached queries, a 3rd test was executed, to see how it scaled to 500 *50 queries. The database is loaded and stored in memory (static) every test run.

method	single (ms)	average over 100 (ms)	average over 500 (ms)
normal	102.40	96.00
normal w/ memcached	94.75	43.86	42.86
sqlite	113.56	117.72
sqlite w/ memcached	122.66	63.10	67.68

the virtualbox instance, and differences in processing power, might skew these results, but the relative difference should be about the same.

method	single	average over 100	average over 500
normal	100%	100%
normal w/ memcached	92%	45%	44%
sqlite	110%	135%
sqlite w/ memcached	119%	65%	70%

Conclusion

While this was tested with a basic / prototype implementation of SQLite3, but it seems the data is still too small to get a real performance difference here due to the loading overhead. However, memcached proved we can get a 55% speed improvement by using an in-memory cache.

Edit: MySQL and APC

Since the overhead of Sqlite is likely caused by loading the file etc, and not so much by the query, I decided to give it a run on SQL (MariaDb). Results are below, using the same testing method. To prevent any impact from outside on the solution, the normal tests were run again and used as reference. (Prevents processor load by other applications from distorting test results).

method	single (ms)	average over 100 (ms)	average over 500 (ms)
normal	92,57	90,80
normal w/ memcached	83,30	36,70	45,40
normal w/ APC	79	34,70	33,65
SQL	76,70	54,23
SQL w/ memcached	65,87	28,63	27,12
SQL w/ APC	56,80	21,33	23,73

method	single	average over 100	average over 500
normal	100%	100%
normal w/ memcached	90%	40%	50%
normal w/ APC	85%	38%	37%
SQL	82%	59%
SQL w/ memcached	71%	32%	30%
SQL w/ APC	61%	23%	26%

As we can see, SQL removes the speed limit imposed by the sqlite loading. It also comes relatively close to the normal approach combined with memcached. Possible disadvantages are the fact that this requires an SQL server.

When combined with APC, we can see between 5 and 10% speed improvements over memcached. The final results are significant. When using SQL and APC combined, a benchmark starting from an empty cache is handled 40% faster than now. When running a benchmark multiple times, the average single benchmark is handled up to 77% faster than a request at this moment. Without moving to SQL, we get 5% improvement on an empty cache and a 60% improvement on a cache hit. The 5% improvement for the 1 hit test with apc is to be ignored.

How can a single benchmark starting from an empty cache still improve performance

The caching mechanism (APC or memcached) store the stations one by one, based on their query. However, when querying a list of liveboard stations with an empty cache, the cache will be filled. For testing, the liveboard of Brussels-south, 8am was used. This contains multiple entries for Ghent, Antwerp and Schaarbeek. Since we're using caching on station query level, only the first "Schaarbeek" query will be ran, all the others will result in a cache hit and quickly return the earlier calculated value.

Conclusion

APC should be easy to install on the server, and is extremely easy to implement as well. Therefore, we should add it without hesitating. MySql might add another 10-30% performance gain, at the cost of requiring an SQL server. This can be discussed.

Bertware commented 7 years ago

The final implementation reduces a liveboard benchmark to a value between 0.75 and 5 msec. This is a reduction of 95 - 99,17%. Therefore, there's no need for SQL here. However, the benchmarks above give some idea on the response time of different technologies, and might be useful for other (iRail) projects too.

Final results:

php5 ~/.composer/vendor/bin/phpunit
Testing 100 liveboards took an average of 1747.603430748 ms for 1 liveboard, with APC disabled.

php5 -d apc.enable_cli=1 ~/.composer/vendor/bin/phpunit 
Testing 100 liveboards took an average of 10.28352022171 ms for 1 liveboard, with APC enabled.

php7 ~/.composer/vendor/bin/phpunit 
Testing 100 liveboards took an average of 92.941551208496 ms for 1 liveboard, with APC disabled.

php7 -d apc.enable_cli=1 ~/.composer/vendor/bin/phpunit 
Testing 100 liveboards took an average of 0.79075813293457 ms for 1 liveboard, with APC enabled.

pietercolpaert commented 7 years ago

Great work @Bertware! For some reason I supposed that PHP would optimize getting files from disk, yet your tests clearly indicate this is not the case.

I have left some comments in pull request #112:

Let’s abstract our caching system using PSR-6: http://www.php-fig.org/psr/psr-6/
Let’s make sure people can also use this package without cache installed for people quickly wanting to do a pull request and testing their changes locally. Therefore, we also need the tests to work without cache
The iRail servers work with memcached only at this moment, so make sure it works with memcached primarily.

Bertware commented 7 years ago

I agree on abstracting using PSR-6 and the running while it's not available. Currently it already works without PHP available, I only overlooked the clear-cache statement in the tests (All code is already wrapped in IF statements to check if apc is available).

For the iRail servers, I would not use memcached. The additional improvements by APC are way to big to ignore. Note that the current code achieves way better performance than mentioned in the tests above (I'd say APC easily halves the time required by Memcached). Using memcached might lower the priority of using APC ("We're already using caching, so APC is not top priority"). Therefore, I would do it right from the first time, and just install APC.

As I discussed in #88, we do need to move to PHP7 too. The install of APC could be combined by upgrading to PHP7. PHP7 is the way to go anyhow, and shouldn't break too much code (if it breaks anything at all). That 2000x speedup should also allow serving more requests from the same hardware, which in turn should lower costs. (I know that stations is only part of iRail, but any improvement is good, and I might look into improving iRail API as well)

pietercolpaert commented 6 years ago

I believe this issue can be closed!

iRail / stations