Closed Tasignotas closed 9 years ago
Where should we put Benchmark_example.ipynb? It can't go at the top level of Topographica, of course. By default .ipynb files go in doc/Tutorials, though this one isn't really a tutorial at the moment. Maybe it's better to have it in topo/tests/?
There are various files in the pull request that have only changes in whitespace. It would be good to revert those files to the master version to simplify the pull request.
I have implemented all of the changes I wanted to implement.
The performance improvements I was planning to make will be described as separate issues once this pull request is merged.
@philippjfr, @jlstevens: are you happy with the implementation?
Yep, I'm happy for it to be merged immediately. Only one minor comment/question, do we really need all that benchmarking output to be committed?
Looks great!
That said, I agree with Philipp and definitely don't want to see Benchmarking.ipynb
and all the Lancet info/log files committed to master - the benchmarks will quickly go stale as the code improves and the hardware changes. I think it is fine if that stuff remains in your repository and I'm sure Jim will have a suitable copy!
Thanks for making those fixes! These files all look ready to merge, to me::
examples/gcal_sparse.ty topo/gpu/init.py (though I don't much like empty init.py files) topo/gpu/cuda_profiler.conf topo/gpu/projection.py topo/gpu/sheet.py topo/tests/data_traintests/gcal_sparse_gpu.ty_DATA topo/tests/gcal_sparse_gpu.ty topo/tests/gcal_sparse.ty topo/tests/runtests.py
We shouldn't have any of the benchmark output files in Topographica itself, but we should keep a copy in your project directory in the CSNG filesystem::
topo/tests/selected benchmarks/
I can't tell what we need this for::
topo/tests/timing_script.sh
It's referred to by the various benchmark outputs, but I don't see it listed in the remaining code, so I can't see where it would be called. If it's called somewhere, then I guess it's ready to merge too; otherwise it should be omitted.
That just leaves::
topo/tests/Benchmarking.ipynb
This file should have the output cleared, but then I think it is useful to have in the tests/ directory; it's short and shows how to do the benchmarks. Ideally it would move into examples/, with explanations, but at least it's something. It does need a paragraph of text at the top of it explaining what it's for and what it does.
After these changes, I think it's best to make a new pull request with a clean set of changes, so that we don't get all that benchmarking output into the git history. Thanks!
I'm not sure a cleared Benchmarking.ipynb
is going to be much use. Either it isn't cleared in order to show the results (not recommended) or requires the benchmark data which we don't want to see committed.
As you say, it does show how to do benchmarks which I do agree is a valuable thing and certainly worth keeping somewhere. I'm just not sure committing it to master (namely, an untested notebook that will go stale) is the appropriate solution.
Maybe keep it around as a gist? Or maybe commit it to the ioam.github.com repository somewhere?
What do you mean, "requires the benchmark data"? Doesn't it generate the benchmark data? That's certainly what I would want of such an example.
I mean that to reproduce the particular set of benchmarks shown in the uncleared version of the notebook you would need the corresponding benchmark data (generated using Lancet).
Of course, the notebook should be able to generate and plot new benchmarks if necessary. The value of the notebook could be to generate new benchmarks but then I would say that for development something like airspeed velocity would be more useful (i.e. benchmarks as new commits are made). This makes sense as I am sure we are convinced by now that GPUs can certainly give us a worthwhile speed up!
Well, the notebook itself shows how to generate the benchmark data using Lancet, so there is no separate step needed, so I don't think we need to focus on the specific set of benchmark data available so far.
Yes, airspeed velocity would be great to set up, but that's a different use case -- testing different versions over time. This example is about how to piece apart the various main components of the runtime so that we can see how they will scale.
Overall, what I would most like (given that no one is currently volunteering to set up airspeed velocity for us) is to eventually have this .ipynb file as a runnable example (which requires it to be tested, of course) in examples/, set up ready to generate benchmarking data for any list of .ty files or parameter values, etc. Since it's not currently set up that way, I'm slightly in favor of putting it in topo/tests for now anyway so that it's at least available as such an example, even if not fully ready to go for new work, but it's true that it could be a gist instead. It's not clear how anyone would find the gist, but that's a little bit true of topo/tests as well.
In general, I just want to make sure that Ignotas' hard work in developing the .ipynb file is available to the next person who wants to do benchmarking. If the next person to do benchmarking is Jean-Luc or Philipp, it's probably best if the approach is generalized and put somewhere more central so that we can use it regularly, and tests will need to be written for it. If so, wherever you guys want to put it is fine. If not, let's keep it in topo/tests for our reference; yes it will go stale, but that's still better than our existing (i.e. non-existent) benchmarking code.
Ok. I am happy for it to go in topo/tests
along with the timing script.
The reason why I wanted to include some sample benchmarks is that it takes a long time to run them. Also, in order to be able to draw the first couple performance graphs, different Topographica versions would have to be checked out and benchmarked, which would take at least a few hours. But as requested, I'll remove the sample benchmarks that I have added.
I'll also improve the description of the benchmarking procedure in Benchmarking.ipynb
Just because we would prefer not to include the benchmarks on master doesn't mean we are throwing them away!
We definitely want to keep a record of your results but that doesn't necessarily require us to commit it to the main repository. I'm sure we will find a suitable place to store these benchamarks...
Anyway, looking forward to seeing this merged!
Just to be clear, why would different Topographica versions be required? Once this is merged, won't everything be in the same branch? I'd certainly want an example that doesn't required different checkouts, and I can't see why those would be needed anyway.
It's just the way I set my benchmarking procedure - I would distinguish between different versions of Topographica by adding a tag to each of them and then running the benchmark. For example, when I wanted to benchmark a particular old version of GPU Topographica, I just went back to a certain commit I made a couple of months ago, tagged it and benchmarked it. I could then group all benchmarks that have the same tag and graph it. Also, everyone else can just checkout the same tag and reproduce my results.
Ok. Please just make the .ipynb file use the latest version, without trying to go back in history; that way it will make sense on master. And then we're ready to merge!
Yep, that's done automatically in the timing_script.sh
by calling git describe --tags
.
Ok, that should be it.
After these changes, I think it's best to make a new pull request with a clean set of changes, so that we don't get all that benchmarking output into the git history. Thanks!
Do we still care about this or are we okay having all this output in the history?
Looks great! I'm happy for the final result of this to be merged, as it's all clean now. But I don't want the history to go into Topographica's master -- I don't think we have any reason to go back in history for this project, and it's much simpler without it. So, Philipp, can you please tell Ignotas how to achieve that? Thanks!
This should be fairly straightforward but make sure you have a backup copy of the current branch so these benchmarking files don't get lost. What you'll want to do is first to branch from the latest commit this PR is based off. Once you have this branch you can apply git filter-branch
to it to delete specific files from the entire history. This Stackoverflow post explains in detail how to go about it.
The command will look something like:
git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch topo/tests/selected_benchmarks" -- --all
But I don't just want to delete specific files, I want to delete all history. In the worst case, can't he simply get these changes as a single diff, apply the diff to the master, and make a new branch out of that?
I think it's pretty bad practice to squash the entire history of his changes into a single commit. The approach above will retain all commits that aren't concerned with the files we want to delete. If you really just want to squash everything into a single commit, then your approach is fine. If I was Ignotas though I'd be pretty annoyed if several months of my contributions to a project were squashed into a single commit.
I completely agree with Philipp and I think squashing all of Ignotas' work to a single commit defeats the point of version control.
Phillipp's suggestion of running filter-branch
seems most reasonable to me and I wouldn't even mind merging everything now - the Lancet output for the benchmarks is all just text (i.e small and easily compressed by git).
On the other hand, if the benchmark notebook was ever committed without being cleared, then a filter-branch for that (followed by a re-commit) would be a good idea.
The point of version control for Ignotas' project is entirely separate from the point of version control for Topographica. Each of the individual commits can be preserved in this fork; there's no sense in which they must disappear! This fork can happily act as the record of that project. But as far as Topographica is concerned, I want the result of that project, and in this case I strongly believe that the history of commits is not useful for people reviewing the history of Topographica development.
In the final PR, there are 11 files covered, including 9 brand-new files by my count, 1 existing file deleted, and 1 existing file changed. But if you look back at the 90 different commits in the history, there were tons of changes to Topographica's existing files. Putting those transitory changes into the history of those existing files would needlessly complicate the job of anyone who needed to follow those files over time. I fought hard to end up with a simple, clear set of changes that can be applied to Topographica, and including the 90 commits that it took to whittle those changes down to a clean, meaningful set makes that work moot. Please don't do it!
But if you look back at the 90 different commits in the history, there were tons of changes to Topographica's existing files.
If Ignotas does a filter-branch to remove everything except the 11 files in the final PR from the history, then there won't be any commits associated with any other files that may have been touched. In other words, there will only be a history for the files we actually want changed after merging.
That is exactly what filter-branch
is useful for: Philipp's suggestion is entirely appropriate whereas squashing everything down to a single diff is simply not the right thing to do.
But in that case the history will be nearly useless, because the actual development of those files took place on the corresponding existing files (e.g. examples/gcal_oo_or.ty), not the final locations of the new files (which are sparse or gpu specific). The partial history that would be retained after that operation seems very unlikely to be of value to anyone.
Basically, if the history of these files really showed the process of their development, without messing with lots of other unrelated Topographica files, then yes, we'd want the history. But that's not true in this case -- there is very little history associated with the new files, and the history associated with the old ones is just going to be confusing rather than helpful, except in the context of following the progress of Ignotas's project chronologically (which is not relevant to Topographica users per se).
Ok. I think I might agree in the end but only because (from the sound of it) the files we are interested in were moved to their present location at the very end of the project. If the files had been in the right place all a long I would not want to erase their history!
One approach to this is an 'interactive rebase' which would allow Ignotas to discard inappropriate commits, squash multiple commits together etc.
But why bother? All this means is that a filter-branch
based on the current file locations will result in a very short history based on where the files are now (which is what you want).
Sounds like I agree in the end too, simply because it sounds like a giant pain to make clean commits with any amount of history. I think a filter-branch leaving only changes to the files touched in the final commit would still produce sensible results. So would a interactive rebase, but that might be more effort. The easiest thing might be just as you suggest, make a clean clone of topographica and then apply the final diff as a number of sensible chunks grouped into just a few commits.
Exactly; they were developed in place, without regard to their final locations, and only moved at the end. Indeed, had they been separate from the start there would be useful history to retain! That's why I personally think a single diff/patch approach will be more transparent: this makes it clear there is no history, and the commit message (or more than one, if there are meaningful groups to break into multiple commits) can say where to go to find the history.
I made a new branch with all of the commits tidied up: https://github.com/ioam/topographica/pull/621
These are the changes I've been working on for my bachelor's project - a GPU-based version of Topographica that with sparse projections, making the simulations several times faster.
A detailed description of the architecture of the GPU Topographica together with design justifications, implementation issues and benchmarking results can be found on http://homepages.inf.ed.ac.uk/s1137931/thesis.pdf.
Before this can be merged into master, I am planning to refactor this version and implement a couple of performance improvements I missed initially.