Closed sandybrownlee closed 7 years ago
Yes, that's intentional IIRC. Do you think it shouldn't? As it's local search, I was thinking it's useful to be able to backtrack... consider that execution times are noisy.
I guess that's basically asking for a re-evaluation of the original unaltered source? I was seeing empty Patches taking the "best" run time when I suspect they shouldn't be - i.e. after some other patches had been found that did offer an improvement - I will check this in the morning.
I think that's probably due to noisy execution times. This is a problem generally with non-functional optimisation. I think doing something rigorous here would be great, e.g. take a similar approach to "racing" algorithms as in irace2.
Got it. The JUnit tests were failing, so all program variants were running in about 1ms (I thought that the run was a bit fast!)). Due to noise this meant that the fastest patch varied.
The reason the JUnit tests were failing was down to a copy+paste error in the example test class I wrote. I'll raise a separate issue for that.
btw - agree wrt racing for handling noise.
I have used irace before, but maybe my understanding is not 100% correct. To apply racing in our setting, we’d need to consider sets of patches as well as sets of test cases, right? Then, different configurations (patches) can race each other across different test cases. Note that irace does not re-evaluate config-instance-tuples, if I remember correctly, and also it does not necessarily run all configs on all problems. Do we want to go there right away? Note that we tend to be in the multi-objective space, or at least in the lexicographic space, so for irace to work, we'd need to post-process the test results into a single number, as otherwise I am not sure how the rank-based Friedman test can be applied.
A general question: what are the objectives, and do we want to have different modes for them? For example: Mode 1) Minimise the number of failed test cases. Mode 2) Minimise the number of failed test cases, while minimising the runtime as well. (lexicographic single-objective optimisation) Mode 3) Minimise the number of failed test cases, while minimising the runtime as well. (multi-objective optimisation) . . .
Apologies if I have missed these details in other discussions.
On 8 Nov 2017, at 10:23 pm, Sandy Brownlee <notifications@github.com mailto:notifications@github.com> wrote:
Got it. The JUnit tests were failing, so all program variants were running in about 1ms (I thought that the run was a bit fast!)). Due to noise this meant that the fastest patch varied.
The reason the JUnit tests were failing was down to a copy+paste error in the example test class I wrote. I'll raise a separate issue for that.
btw - agree wrt racing for handling noise.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gintool/gin/issues/16#issuecomment-342795593, or mute the thread https://github.com/notifications/unsubscribe-auth/AH4Zs21XHf6zDA0rvB9gEZb2juO1110Wks5s0ZZHgaJpZM4QVIeo.
https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png https://github.com/gintool/gin https://github.com/gintool/gin/issues/16#issuecomment-342795593
Thanks for chiming in, Markus.
I can see a few ways of implementing a racing approach:
Racing two (or n) patches across a set of test cases (this fits well with local search or tournament selection)
Repeatedly sampling the same set of tests with two patches until we can determine a difference between them with a certain statistical confidence.
Do we want to go there right away?
Not necessarily, but I think we need to work out some way of solving the execution time comparison problem. Alternatives include counting the number of bytecodes executed, and using a model to estimate expected execution time, for instance.
Immediate question is - how is this solved in other software? I know that some work takes the 3rd quartile as a conservative estimate of the execution time, for example.
Counting bytecodes does give the same number each time which makes comparison easier.
The library I used for this (bycounter - https://sdqweb.ipd.kit.edu/wiki/ByCounter) doesn't appear to have been updated since 2015. The library isn't designed for parallel execution in multiple threads in the same JVM, and I'm unsure if it works for Java 7+.
Adding modelling info might capture more of the nuances of the execution time of the JVM, though it might be a reasonable amount of work in itself.
In other work, I picked the best time from repeated (3) runs. Worked fine where performance improvements were relatively obvious. I also used a post analysis phase to get a more accurate quantification of time savings. The more repeats the better but it is expensive.
As you mention David, maybe a phased/racing approach? Where any program passing all tests is instrumented for counting bytecodes, and the fastest of those are then subject to repeats to get a clear ordering?
(by the way: tell me if this is a discussion for the slack channel instead)
Alright, let me see how well this “email to conversion” gets along with in-text-commenting as well as embedded images...
On 11 Nov 2017, at 5:13 am, Brendan Cody-Kenny <notifications@github.com mailto:notifications@github.com> wrote:
Counting bytecodes does give the same number each time which makes comparison easier.
This is true.
I know you are offering a proxy for runtime… how can we deal with the duplication of function calls or even system calls, that increase the number of byte codes by a bit, but the runtime for a big delta? Maybe this is what the modelling you are mentioning is good for. If so: is this for gin 1.0 or for a gin-hack?
The library I used for this (bycounter <https://sdqweb.ipd.kit.edu/wiki/ByCounter https://sdqweb.ipd.kit.edu/wiki/ByCounter>) doesn't appear to have been updated since 2015. The library isn't designed for parallel execution in multiple threads in the same JVM, and I'm unsure if it works for Java 7+.
Despite the 2015 timestamp, I have only found a 2013 version, https://sdqweb.ipd.kit.edu/eclipse/ByCounter/releases/ https://sdqweb.ipd.kit.edu/eclipse/ByCounter/releases/
Adding modelling info might capture more of the nuances of the execution time of the JVM, though it might be a reasonable amount of work in itself.
Agreed.
In other work, I picked the best time from repeated (3) runs. Worked fine where performance improvements were relatively obvious.
May I ask: what was the order of magnitude in the measurements, and the overall distribution? Great if changes have really been obvious. We have seen the worst of all worlds with our energy measurements (read: lots of obvious and not-so-obvious system and time and … -dependencies)
I know that some process has to be chosen… the more thoroughly constructed (with a tiny fraction of proper justification), the better.
I also used a post analysis phase to get a more accurate quantification of time savings.
This is definitely something that the user needs to do.
Well, aside from the following, easy post-processing that I have seen in irace: a final testing phase can be run in which all “good” configurations are run and then the performance is printed. In my current research, I let the algorithm rerun 10k times to get stable means. Yes, 10k might not be suitable - this is up to the user to configure. —> This points us at the following: ideally, gin supports a manual mode, e.g. “apply these patches and run them 10k times”. I know this is just a wrapper, but I as a researcher would appreciate it.
Slightly off-topic: I can dig out a reference for a statement like the following: “it is better to run more test cases once than to run on a few repeatedly many times” (there is a proper mathematical justification for this).
The more repeats the better but it is expensive.
Agreed.
Maybe a phased approach? Where any program passing all tests is instrumented for counting bytecodes, and the fastest of those are then subject to repeats to get a clear ordering?
As I have no data to support a strong correlation of byte code and times, I am a bit hesitant, but I am open for it. As an algorithms person, I like pure approaches ;)
A brief digression on “counting the number of instructions executed follows… Stackoverflow: Java count individual bytecode instructions executed https://stackoverflow.com/questions/42215583/java-count-individual-bytecode-instructions-executed https://stackoverflow.com/questions/42215583/java-count-individual-bytecode-instructions-executed —> With respect to jawa, I see the following comment on the stackoverflow page:
==> Counting the executed statements seem to be supported via a “trivial JVM” (?).
jawa: my python configuration is completely messed up, but I should be able to run it on an optlogserver if necessary.
Alternatives (from the stackoverflow page): (1) ByCounter (as mentioned before) seem to be available without problems https://sdqweb.ipd.kit.edu/eclipse/ByCounter/releases/2.0/ https://sdqweb.ipd.kit.edu/eclipse/ByCounter/releases/2.0/, (development appears to have stopped in 2013, which is not necessarily a bad thing) (2) ASM (still developed in 2016), http://asm.ow2.org/ http://asm.ow2.org/, ==> I could not find how to count executed statements in both programs
Funky alternatives: (3) Hotspot JVM, http://openjdk.java.net/groups/hotspot/ http://openjdk.java.net/groups/hotspot/, with command-line flags from https://stackoverflow.com/questions/13811082/how-to-count-number-of-bytecodes-executed-in-java https://stackoverflow.com/questions/13811082/how-to-count-number-of-bytecodes-executed-in-java This comes with the flag -XX:+CountBytecodes which sounds nice - it might just count globally, not on a method level, which could be achievable with jawa, ByCounter, ASm, ... ==> This requires to build a “debug version” of the JVM (which is not the standard JVM I had on my system) :-O I have followed the instructions on http://jcdav.is/2015/08/26/building-a-debug-jvm/ http://jcdav.is/2015/08/26/building-a-debug-jvm/ to the extent possible. Summary of the last three hours: no success with compilation on my Mac. The situation might be quite different for your machines.
Details on Instructions: —> I have followed a whole bunch of examples: (1) http://jcdav.is/2015/08/26/building-a-debug-jvm/ http://jcdav.is/2015/08/26/building-a-debug-jvm/ ==> does not compile
Future references: JVM options (which might only be available in debug mode): Lists (super many!): http://pingtimeout.github.io/jvm-options/ http://pingtimeout.github.io/jvm-options/ and http://stas-blogspot.blogspot.com.au/2011/07/most-complete-list-of-xx-options-for.html http://stas-blogspot.blogspot.com.au/2011/07/most-complete-list-of-xx-options-for.html -XX countBytecodes —> https://stackoverflow.com/questions/25114261/best-method-to-count-byte-codes-executed-for-a-java-code https://stackoverflow.com/questions/25114261/best-method-to-count-byte-codes-executed-for-a-java-code -XX:+CountBytecodes and -XX:-UseCompiler (???) —> https://stackoverflow.com/questions/13811082/how-to-count-number-of-bytecodes-executed-in-java https://stackoverflow.com/questions/13811082/how-to-count-number-of-bytecodes-executed-in-java
On 10 November 2017 at 11:52, David R. White <notifications@github.com mailto:notifications@github.com> wrote:
Thanks for chiming in, Markus.
I can see a few ways of implementing a racing approach:
-
Racing two (or n) patches across a set of test cases (this fits well with local search or tournament selection)
Repeatedly sampling the same set of tests with two patches until we can determine a difference between them with a certain statistical confidence.
Do we want to go there right away?
Not necessarily, but I think we need to work out some way of solving the execution time comparison problem. Alternatives include counting the number of bytecodes executed, and using a model to estimate expected execution time, for instance.
Immediate question is - how is this solved in other software? I know that some work takes the 3rd quartile as a conservative estimate of the execution time, for example.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/gintool/gin/issues/16#issuecomment-343454075 https://github.com/gintool/gin/issues/16#issuecomment-343454075>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF0t4zNMzUhB4oTeBrzziGpFjRXg4H4Kks5s1DjggaJpZM4QVIeo https://github.com/notifications/unsubscribe-auth/AF0t4zNMzUhB4oTeBrzziGpFjRXg4H4Kks5s1DjggaJpZM4QVIeo> .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gintool/gin/issues/16#issuecomment-343553820, or mute the thread https://github.com/notifications/unsubscribe-auth/AH4Zs2rrZ-AO8zk5VWPGig5WcnJpTg7Xks5s1JljgaJpZM4QVIeo.
On 10 Nov 2017, at 10:22 pm, David R. White <notifications@github.com mailto:notifications@github.com> wrote:
Thanks for chiming in, Markus.
I can see a few ways of implementing a racing approach:
Racing two (or n) patches across a set of test cases (this fits well with local search or tournament selection)
An interesting idea. Normally, races are on incrementally increasing sets of test cases. So you suggestion is to repeatedly run (up to a limit) until a statistically significant difference is established, potentially allowing the set of sets to be increase over time and potentially including allowing duplicate runs on the same tests?
Mind if I reach out to a colleague who works on irace, for a casual chat?
Note: I am super wary of sub-second runtimes and high variances on modern multi-core platforms.
Repeatedly sampling the same set of tests with two patches until we can determine a difference between them with a certain statistical confidence. [...while considering a maximum number of repetitions…]
Okay. A rank-based test on the runtimes should be fine.
Fun fact: I’d love to run both tests (or all) alternatingly, as effects can indeed build up over time on machines. I mean: better ABABABABAB than AAAAABBBBB. Okay? Yeah for flexible frameworks...
Do we want to go there right away?
Not necessarily, but I think we need to work out some way of solving the execution time comparison problem.
Agreed! Alternatives include counting the number of bytecodes executed, and using a model to estimate expected execution time, for instance.
Surrogates/proxies are nice. Do we know something about the portability of models across various machines and operating systems? I know, “a start” is better than “no start”. Just throwing into the round that there are dragons ahead. Definitely an opportunity for a future gin hack. If we could provide a framework to create comprehensive models for a machine, that would make for an awesome contribution by itself - google did not help me much here right now. Immediate question is - how is this solved in other software? I know that some work takes the 3rd quartile as a conservative estimate of the execution time, for example.
Citation, please? :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gintool/gin/issues/16#issuecomment-343454075, or mute the thread https://github.com/notifications/unsubscribe-auth/AH4Zs8Pfdhvvq2Qn7Fwa3iMoWE7Zcndpks5s1DjggaJpZM4QVIeo.
This is delving in to territory that I'm less familiar with but we ought to be careful with counting bytecodes because of JIT. The approach that Nathan Burles developed for measuring energy using bytecode tracing ran into issues when JIT was enabled - aside from producing stochastic results I guess it would probably also vary per CPU and OS.
Certainly as a proxy it would be okay, particularly if the differences in run time are relatively big, but it would possibly need supplemented with something else. (though I'm not sure what!)
Some super-useful comments here, although we've departed a long way from the original bug report!
@sandybrownlee - please could you confirm whether this was a bug? Otherwise we should close this!
@markuswagnergithub @sandybrownlee @codykenb - I'm going to start a channel on Slack for this conversation and summarise the above.
We confirmed that size-zero patches are possible, and intended, so if anything it was a behaviour I didn't expect rather than a bug. I think we can safely close this.
LocalSearch.neighbour() can generate a patch with no Edits in it, by deleting an edit when patch size == 1. Not sure if this is intentional or not!