hanabi1224 / Programming-Language-Benchmarks

Yet another implementation of computer language benchmarks game
https://programming-language-benchmarks.vercel.app/
MIT License
627 stars 134 forks source link

Will programs be required to use the same algorithm as other programs? #219

Open JZerf opened 2 years ago

JZerf commented 2 years ago

One of the requirements for the Computer Language Benchmarks Game is that programs are supposed to be using the same (rough) algorithm. Are these benchmarks going to have a similar requirement?

This is a noble goal but in practice it seems to be pretty hard to enforce. There have been many occasions on the Computer Language Benchmarks Game where a program has been accepted only to find out much later that the program was actually using a considerably different algorithm requiring it (and possibly other programs based on it) to be removed. People need to manually check the programs to see if they are using different algorithms and this is hard and time consuming to do so it isn't done as much as we would like.

Additionally some programming languages such as functional languages (like Haskell) can require considerably different algorithms to solve problems. There are also some grey areas where it doesn't become quite so clear if something is using a different algorithm or not. This might include things like rewriting algorithms to use multithreading or SIMD.

Because of these issues, I personally would suggest giving program authors flexibility with the algorithms they use. Over the long run, people will determine optimal algorithms to solve the problems and those optimal algorithms will be adopted by most of the programs anyway.

hanabi1224 commented 2 years ago

This might include things like rewriting algorithms to use multithreading or SIMD.

I'm also thinking about this. I would not ban them but probably mark these on the website in some notable way.

I personally would suggest giving program authors flexibility with the algorithms they use.

It depends on whether a problem is well-designed for all langs in terms of what it tends to measure. As different langs may have different capabilities, this may result in comparing oranges to apples with no possible way to fix

bpecsek commented 2 years ago

This might include things like rewriting algorithms to use multithreading or SIMD.

I'm also thinking about this. I would not ban them but probably mark these on the website in some notable way.

Please note that languages doing auto vectorization might use SIMD in the assy/executable that might not be seen in the original code at all. It would be difficult to track without actually looking at the generated assembly code. Compiler might add auto vectorization any time.

I personally would suggest giving program authors flexibility with the algorithms they use.

It depends on whether a problem is well-designed for all langs in terms of what it tends to measure. As different langs may have different capabilities, this may result in comparing oranges to apples with no possible way to fix

Code flexibility should be vital in my opinion. We can even learn from each other and amend our codes accordingly if beneficial to that specific language.

HenryJk commented 2 years ago

It is possible to allow any algorithm and even make the code validation process almost automatic. However, a stronger test needs to be created since many problems can easily be cheated if "anything and everything" is allowed.

For example, fannkuch-redux currently only validate via checksum and max_flip, both of which can easily be pre-computed since input size is quite small. Then developers can submit only a lookup table and it will still pass the automated test.

Since this project is still quite young, I think it is better to enforce it earlier rather than later, before porting code submissions starts becoming bigger issue.

JZerf commented 2 years ago

It is possible to allow any algorithm and even make the code validation process almost automatic. However, a stronger test needs to be created since many problems can easily be cheated if "anything and everything" is allowed.

That's a good point and probably something I should have brought up in the original issue.

I think it would be a good idea to try to design the problems to be more resilient to some obvious speedups/cheating. I think some of the current problems like the fasta and mandelbrot problems could have small tweaks made to them to make them more resilient to some algorithm improvements and it would probably be worthwhile to do, see my footnote for some examples [^1].

I didn't mention it in my original issue but there would still need to be at least a few rules such as not using different programming languages (things like inline assembly) and not just including a lookup table to lookup results like you mentioned. These are things that we would still have to be on the lookout for but these ones should probably be fairly obvious and easy to find. I think this would be easier to deal with than the alternative of:

[^1]: The sequences generated by the fasta problems are repetitive and could just be copied. Increasing the value of the IM constant should resolve that issue by making the sequences less repetitive. The mandelbrot problem generates an image that is symmetrical across the x-axis which could allow one half of the image to just be copied and flipped. Changing a couple of the values in the mandelbrot program so that it generates an image of a different, non-symmetrical part of the fractal would resolve that issue.

HenryJk commented 2 years ago

I didn't mention it in my original issue but there would still need to be at least a few rules such as not using different programming languages (things like inline assembly) and not just including a lookup table to lookup results like you mentioned.

I think making efforts towards a fully-automated or near-automated testing is viable to reduce maintenance burden.

IMO, we can design the problem so that lookup table doesn't matter. I'm only familiar with fannkuch-redux and nbody so I would give those 2 examples.

For fannkuch-redux, we can ask the contributor to simply compute and print the number of flip for a single permutation of 10000 numbers. However, the permutation is randomly chosen and was given to the program via argument or stdin. During the test, there is a reference implementation whose code is simple but correct to compare with. There is going to be some overhead due to I/O and string-to-int conversion. However, majority of the program time is still for "flipping numbers".

For nbody, an arbitrary starting positions, velocity, and number of steps will be given to the program via argument or stdin. Again, the input will be randomly generated and the program have to print not just total energy, but also final positions of the stellar bodies.

Those are my idea to "defeat" lookup table.

For inline assembly, it is possible to use compiler flag to disable it or to ban "asm" keyword entirely from contributed source code.

bpecsek commented 2 years ago

Hi Guys,

However, before resolving all the potential issues to stop programmers/contributors cheating, how can the site maintainer be controlled to stop doing what he’s been doing with me and the codes I have contributed?

How can it be guaranteed that legal codes PR-ed meeting all the rules are actually benchmarked and not much slower codes picked/modified by the maintainer are listed?

Just look at https://github.com/hanabi1224/Programming-Language-Benchmarks/pull/228 just sitting there with much faster Common Lisp, Rust, C++ and C codes untouched for a month (original one https://github.com/hanabi1224/Programming-Language-Benchmarks/pull/221 was closed by me and reissued).

Please, who has not done it yet, read through and form an opinion and try to help resolve it become a particular compiler and someone was accused doing something without the slightest proof. Just because the codes run too fast compared to another one and because another two languages were doing something not allowed by the rules and those facts were enough to disqualify and remove perfectly valid codes and then the contributor basically banned from the site for being right from the beginning. I have not been getting a single reply, not even a straight yes or no, to any of my messages for a month and none of my commits are accepted.

Sorry for sounding rather angry, frustrated and infuriated but it is because I actually am.

Kindest Regards, Bela

Sent from my iPhone

On 2022. Jan 28., at 1:33, Jeremy Zerfas @.***> wrote:

having programs being wrongly disqualified which happened in #144

HenryJk commented 2 years ago

Seems like the root of your problem is about binarytree allocation issue and/or SBCL? Maybe remove the controversial parts from your PR and re-open with uncontroversial part only?

But your problem is adjacent with what was discussed in this thread, which is the vague criteria on what program should be accepted. I'm personally fine with competitive programming approach. Input file goes in, output file goes out, if diff says ok, then your program is accepted. Stdlib only, no external library, no inline asm. Other than that, do anything you like, even things like lookup table and whatnot.

bpecsek commented 2 years ago

Hi Henry,

You must not have read the thread thoroughly.

There has been no allocation issue in any of the codes in my PR as it has been proven by multiple sources and multiple methods. The site maintainer accused SBCL of fools play on the basis that it was too fast without the slightest actual proof and when he was proven wrong he went silent. No reply to any of my questions or requests has been given for a good month. Not a single word. Not even a yes or no.

Please read it above in Jeremy’s messages (“having programs being wrongly disqualified which happened in” and read his messages in https://github.com/hanabi1224/Programming-Language-Benchmarks/issues/144 proving SBCL’s correctness), the SBCL language maintainer’s multiple confirmations, these profiling reports https://github.com/hanabi1224/Programming-Language-Benchmarks/issues/223 etc etc.

Please feel free to check any of the codes here .

hanabi1224 eventuality did accept Jeremy’s proof and relisted one old binary tree code and one of his own that is very slow and nothing else. All the much faster yet perfectly correct ones are sitting in my PR unaccepted.

The only one (old 1.cl) that was violating the rule was indicated and asked to be removed by me.

I reiterate again that all the codes in my PR are playing according to the set rules including the binary tree ones, merkle tree ones, nsieve ones and all the others.

I must say that this problem is not at all “adjacent with what was discussed in this thread” since eventually the site maintainer sets the rules and controls this site as this case clearly proves it.

To further prove my point, a very fast nsieve code using different algorithm was also removed. One CL one and the equivalent C code are idling in my PR as well.

Therefore, to make this site a success, both sides - contributors and maintainers as well - must play fairly, abiding the clearly preset and communicated rules.

Kindest Regards, Bela

bpecsek commented 2 years ago

I am not getting any reply from @hanabi1224 so I am closing my PR. As a last resort I might do what you recommended @HenryJk.

bpecsek commented 2 years ago

Dear @HenryJk , As you have recommended I am going to send a new PR later on today with all the latest CL codes as my last trial. All the file names are updated using the latest naming convention. I am keeping the two updated binary tree codes only that are currently listed and am not including the much faster cons based codes for bench-marking though all the updated files are there for reference and analysis if anyone wants to see them. I am not sending any of the C, C++ and Rust codes. If the site maintainer wants to include them they are there in the closed PR. He is free to do anything with them.

bpecsek commented 2 years ago

Hi @HenryJk,

As you can see I did what you recommended.

I have closed my old PR and issued a new one 4 days ago, however, nothing has changed whatsoever as far as the site maintainer’s @hanabi1224 behavior is concerned.

He is not accepting my PR, nor replying to me at all and keeps benchmarking and listing the old and much slower binary tree and nsieve Common Lisp codes.

Therefore, I’d like to say goodbye to you and wish you a better experience here. (Just accept what the site maintainer say or do and you’ll be fine here.)

I shall write a detailed statement regarding this shameful case with all relevant links/references and send it everywhere I have promoted this site in the past for everyone to see what is going on here to allow them to form an opinion before they decide to contribute or promote this site.

As a reference I shall also share the link to my vercel site generated from my fork where the benchmark result for those much faster but unaccepted codes are also shown.

Good luck to you and happy coding.

Kindest Regards, Bela

igouy commented 2 years ago

@JZerf Over the long run, people will determine optimal algorithms to solve the problems and those optimal algorithms will be adopted by most of the programs anyway.

With respect, the benchmarks game shows that only 3 or 4 language implementations continue the endless spiral of optimization — most of the language implementations have been left-behind (and "the long run" is already over a decade).

ssvb commented 2 years ago

@igouy Something similar is also happening in programming competitions. Just only a few of the most popular/pragmatic programming languages are actively used by the contest participants. And C++ is more popular than everything else combined.

For example, we can look at the programming languages usage statistics from a recent AtCoder Beginner Contest 238:

For comparison, these are a few examples of unpopular programming languages:

Is one of the goals of the Programming Language Benchmarks Game project to promote diversity and showcase a much wider variety of programming languages? I like the idea of giving the underdog programming languages a chance.

igouy commented 2 years ago

one of the goals

@hanabi1224 ?

A decade ago, one of the goals of the benchmarks game was — To show working programs written in less familiar programming languages.

And that became too much.

Alternatively shrink the number of tasks to 1 and shrink the size of the task — and increase the different language implementations.

JZerf commented 2 years ago

@JZerf Over the long run, people will determine optimal algorithms to solve the problems and those optimal algorithms will be adopted by most of the programs anyway.

With respect, the benchmarks game shows that only 3 or 4 language implementations continue the endless spiral of optimization — most of the language implementations have been left-behind (and "the long run" is already over a decade).

I see your point but I don't think things are quite as bad as you say and I think with some changes things could be made better. The more popular languages generally are seeing more new contributions so those languages will seem to perform better until other languages catch up. This process is going a bit slower on the CLBG than one would like but it still is happening. The first multi-threaded fasta program was added to the CLBG in June 2014, about a year later there were five programming languages with multi-threaded programs, and after about two and a half years that grew to nine programming languages. There's also the one time where someone made a regex-dna program that was breaking the rules and over time the same thing got copied to over two dozen other programs that all had to be updated/removed which just again shows that this process is happening.

There are some things that could be done to help close the performance gap between programs for popular languages that are being updated frequently and programs for less popular languages that aren't being updated as quickly. You can try slowing down the rate at which programs for popular languages improve by forcing programs to use the same algorithms, try to prevent the use of multi-threading, try to prevent the use of SIMD, etc... but I don't think this is a good approach. Forcing programs to use the same algorithm has a high maintenance burden, it might not be clear if something is using a different algorithm, and it may be unfair for some languages (like functional languages) that require doing things in a different manner. Trying to prevent the use of multi-threading and SIMD seems to be ignoring the reality that CPU cores are not increasing in frequency/performance like they used so CPU makers have been adding more cores and SIMD to help compensate for it.

I think it would be better to try to increase the rate at which programs for less popular languages improve. This project's use of pull requests, continuous integration/continuous delivery (CI/CD), and running the benchmarks on a more readily available system is one thing that I think is an improvement over the CLBG and which I think will help with making it easier and more likely for people to make contributions. As long as somebody makes a correctly functioning pull request and it doesn't violate any of the project's rules, it shouldn't take much effort for it to be quickly accepted. That process could go even faster and require less effort if the project's rules are made simpler which is why I think it would be a good idea to give authors flexibility with algorithms and to avoid the heavy usage of library which both would reduce the time/effort needed for reviewing programs. I also think that having less rules would help with having people make more contributions. There are quite a few occasions where someone ended up breaking one of the project's rules so their program is either not accepted or gets removed later and I imagine some people will get frustrated from that and stop contributing. From personal experience I also know that it can be annoying trying to follow rules. Instead of using whatever techniques/algorithms I want, I often spend some time reviewing multiple programs to see if I can find any that are already doing the same thing so that I can have some justification for also doing it. It can also be frustrating to not be allowed to do some things because it might not be allowed by a rule which seems kind of dumb. Finally one other thing that could possibly help with getting people to contribute more is to better show that some of the programs might be old/obsolete. It might be a good idea to keep track of contribution dates for programs and maybe include a column on the results pages showing how old programs are so if people see that a program is old and possibly obsolete, they might be more likely to review it and contribute a better one.

igouy commented 2 years ago

Happily this isn't an argument I need to win, I'll just provide a counterpoint.

@JZerf … after about two and a half years…

If that was — after about two and a half months — we might agree.

@JZerf You seem not to be concerned with the experience of someone looking at measurements, trying to understand whether performance differences were mostly due to the different language implementations or mostly due to the skill and effort of the programmer.

hanabi1224 commented 2 years ago

after about two and a half years

This may also mean introducing a new algorithm for a particular language will make its number uncomparable to others until they catch up in 2 and a half years lol

You can try slowing down the rate at which programs for popular languages improve by forcing programs to use the same algorithms, try to prevent the use of multi-threading, try to prevent the use of SIMD, etc...

Sorry but these cleanups are actually what I was doing, slowly. My realistic expectation is that most of the programs will still be maintained by myself. I believe most of the contributions still go to CLBG which is way more popular and this is just a fork maintained by myself and only shows my selections of the programs, it makes it much easier for me to just remove those parallelized programs and those using different algorithms.

JZerf commented 2 years ago

This may also mean introducing a new algorithm for a particular language will make its number uncomparable to others until they catch up in 2 and a half years lol

It wouldn't be great to be comparing programs that use better algorithms to other programs using worse algorithms and it would take a while for most of the programs to adopt the better algorithms but for the most part there is a limit to how much this can happen. Generally for any problem there is a limit to how optimal an algorithm for solving it can be made. Once that optimal algorithm is determined and most programs have had time to adopt that algorithm, then the problem you mentioned will no longer exist and will never reoccur.

The alternative is to have people carefully review...every...single...program to make sure it is using the same algorithm and also having to deal with some cases where it might not be so clear whether something is considered as using a different algorithm. This is going to require quite a bit of effort to do and I have my doubts that it will be done well. I think that extra effort will make this project a bit less viable and might eventually contribute to the reviewers getting burnt out (kind of similarly, I wonder if Isaac gets tired of reviewing and rejecting binary-trees programs over and over again). Based on past experiences, I also imagine that these code reviews will miss many cases where a program does actually use a better algorithm but it isn't recognized so it will end up still causing the same problem of programs using better algorithms being compared to other programs using worse algorithms anyway. To make things worse, it'll be quite possible that this will happen repeatedly too.

You can try slowing down the rate at which programs for popular languages improve by forcing programs to use the same algorithms, ...

Sorry but these cleanups are actually what I was doing, slowly.

It seems like this answers my original issue about whether programs will be required to use the same algorithm as other programs. It looks like the stance of this project is that programs WILL be required to use the same algorithms as other programs. I've already explained why I don't think this is a good idea and I don't think I have anything else to add. @hanabi1224: If you don't have anything else to add, I think it will be OK for you to close this issue now.

igouy commented 2 years ago

@JZerf > I wonder if Isaac gets tired of reviewing and rejecting binary-trees programs over and over again

There's something about this that I think worth mentioning — with this kind-of project, how do you know when to call the project finished?

@JZerf > Once that optimal algorithm is determined and most programs have had time to adopt that algorithm, then the problem you mentioned will no longer exist and will never reoccur.

No, not until each of the language implementations actually has programs that use that optimal algorithm.

Until then, a programmer coding contest.