Profiling of trackintel - continuous integration

abcnishant007 commented 3 years ago

Based on our discussion, I think it is better to have a memory profiler for trackintel. It will help us ensure that new pull requests do not significantly worsen the memory requirements. Such a profiler will become more relevant as we introduce multithreaded options for trackintel in general.

abcnishant007 commented 3 years ago

A direct application of mprof on an end-to-end demo script for trackintel gives the plot as shown below:

Figure_1

Now, how do we incorporate this as a feature in trackintel? @hong2223 Should we have this option in one of the functions such as "log_RAM_usage=True/False"?

henrymartin1 commented 3 years ago

Some inspiration: https://pandas.pydata.org/docs/development/contributing.html#running-the-performance-test-suite

abcnishant007 commented 3 years ago

@hong2223 may I request for this to be opened?

hongyeehh commented 3 years ago

Sure! We closed this and added to the roadmap as we thought we will not work on this in the near future. Would you start working on this?

abcnishant007 commented 3 years ago

Yes, I was exploring the asv based continuous integration last weekend and this seemed doable, hence the previous comment. You can open now or we can wait until I have pushed more commits to ensure that it works, then we can open the issue as well as the pull request :) Whatever you suggest.

abcnishant007 commented 2 years ago

This is what I have so far. It appears to be working fine apart from some issues which I will summarise as bullet points in the next comment. Attached are some screenshots for reference. We can discuss them in one of our meetings.

=================================== Speed Benchmark List function wise:

==================================== Speed Benchmark with plots function wise: X axis: Commits Y axis: Performance

==================================== Zoomed image of the above, with a pull request is selected showing significant reduction in speed. This is where we changed the for loop to groupby. (da94120)

abcnishant007 commented 2 years ago

Downgraded pytest to less than 4.0; Need to investigate how it can be run with a more recent version of pytest
We are using selected unittests as benchmarking functions. The tests were renames with a time_ prefix such as time_test_trip_wo_geom;
Some hard coded paths in asv.csv.conf
Use smaller 7 digit commit ids while specifying range of commits between which we should do benchmarking; Full commit names doe not link with the GitHub api for some reason.
It runs in local and generates these html files as the one shown in the previous comment.
Q: Later when this is solved, should these be run in the GitHub actions itself? I think the computational load is very high.

abcnishant007 commented 2 years ago

for every pull request: for every release/ on demand
benchmarks, we need more data;
data should not be a part of the package
first finish time.

henrymartin1 commented 2 years ago

@abcnishant007 What was the status of this issue? Reading through it it looks like it was almost finished. @bifbof is working on some nice performance increases, it would be great if we could track them.

abcnishant007 commented 2 years ago

This commit contains some snippets that came handy in making this work. For the status of this issue, please refer to this comment below instead.

git log | grep -B 1 "Merge" | grep "commit" | sed 's/commit //g' | cut -c1-7 | head -n 10 > commits.txt; Better to run this on the master, so that we don't have some unncessary merges filtered
asv run HASHFILE:commits.txt
asv publish
Using this gist, we need to force github pages to see this directory.
Quoting the asv rtd, "This will put a tree of files in the html directory. This website can not be viewed directly from the local filesystem, since web browsers do not support AJAX requests to the local filesystem. To share the website on the open internet, simply put these files on any webserver that can serve static content. Github Pages works quite well, for example. If using Github Pages, asv includes the convenience command asv gh-pages to automatically publish the results to the gh-pages branch."

abcnishant007 commented 2 years ago

@henrymartin1 /@hongy Working solution is live now @ https://abcnishant007.github.io/trackintel/ last 20 merges into the master and their corresponding performance improvement using around 10MB of geolife data. Excellent improvements noted!

I chose this function based on @bifbof 's comment on the pull request that his improvements are targeted at generate-staypoints function. Over time, I will add more mem and time classes (on demand) for different functions. Creating new benchmark classes is almost a no-brainer, as I can simply copy and adapt from the unit tests for specific functions. The pytest.fixtures serve as the examples for setup functions inside asv benchmarks.

Pending items are: ~1. Finally moving this to mielab/trackintel.github.io ... instead of abcnishant007~

What does not seem feasible (at least based on my current findings):

Making this process automatic for each pull request. There are feature requests for this in asv, but it is not complete.
Running these in Github actions might not be even possible, simply due to the huge run time. On my computer, it took 4 hours just using 10 MB of Geolife data and the last 20 commits.
Additionally, the asv config file needs some computer-specific configurations. I think the most practical solution is to fix a computer on which we are going to run the whole suite once in a while. If we use the same computer to run each time, the asv environment takes care not to re-run on the (commit, benchmark) combinations which have already been benchmarked.

abcnishant007 commented 2 years ago

While viewing the benchmarks panel, it is better to rescale the browser window so that different groups of tests (mem, time, and peakmem) are arranged into columns as shown below:

abcnishant007 commented 2 years ago

This branch https://github.com/abcnishant007/trackintel/tree/benchmark-files contains the files used to benchmark as well as the html files for the hosted github pages.

mie-lab / trackintel

Profiling of trackintel - continuous integration #183