lichess-org / chess-openings

An aggregated data set of chess opening names
Creative Commons Zero v1.0 Universal
371 stars 94 forks source link

What makes a unique opening (variation)? #55

Closed SandroMartens closed 2 years ago

SandroMartens commented 2 years ago

So i'm currently using this dataset to do some analysis on how openings transposition into each other. I realized there are some positions with the same name, but different move orders, positions and different eco codes ( eg Queens Pawn Game has five entries in A40, A41 and D00. The Caro-Kann has six entries in B10, B12 and B15). So if neither name or eco is unique, what could you use for one variation?

Related: Wikipedia and Wikibooks refer to all 1. d4 d5 openings as Closed Game or Double Queens Pawn Game, so maybe we could differentiate this from other 1. d4 positions?

niklasf commented 2 years ago

I believe the full name column should be unique for the shortest sequence of moves. name up to : or eco are definitely not.

There are currently a few violations of that rule: https://github.com/lichess-org/chess-openings/actions/runs/2760626117. Let's see if we can fix them, or if some duplicates make sense.

allanjoseph98 commented 2 years ago

I think that it's reasonable to have name unique for the shortest sequence of moves since there are only 10 violations. I've noticed that most problems stem from copying exotic, rare variations from chesstempo for which it is extremely difficult to find other sources. I've made a study (https://lichess.org/study/N6fyQ5tj) to discuss these problems and additional problems I encountered while researching them. Every variation in a chapter pgn is a variation in the db. I've put sources as comments at every variation.

niklasf commented 2 years ago

Great work! I think we should follow all your recommendations, eliminating almost all issues.

If someone wants to help applying the suggestions to the CSV that would be awesome. I'll get to it eventually, if not.

allanjoseph98 commented 2 years ago

Will make changes and submit a PR

Dboingue commented 2 years ago

could the result of dis-cerning and just plain -cerning this issue (still looking for the english of that french), be made like a curation standard one day? or more humbly, put in the readme page for user, as opposed to contributors only.

I know open source code is supposed to be its only documentation (no tomatoes please, my screen has been wiped last month). And readme files are minimal crumbs for the contributors.. and in between, a curious savvy but not dev user would need to go back to school and become another person in some parallel dimension and come back here.. and figure how to read the code as documentation...

but perhaps a folder in the repository for the scientific or critical usage of the database per its ground assumptions of curation and definitions would help promote the chess content of it to more people, somewhere between dev background and only chess playing awareness....

I would expect programmers to have a need for standards as the machine are really dumb, they need things spelled out, for automation... so that reproducibility should in theory include interpretation of content. that one be able to reproduce the database from its input.. Does open source code and open data have boundaries about the extent of reproducibility.

Sorry if this is too philosophical for an issue. But I find the title appropriate for my 2-cents.. and long vacillating patience with everything chess database (and past traditions).

I also suggest to start thinking graph as underlying consistent mathematical support that might untangle a bunch of transposition issues (the coding technology exist, go to virustotal for example). The tree representation as immutable ground support for such database of chess openings, is a constraint from the past.. it should not justify its own perpetuation.

Edit: this is as much an issue about any opening "theory" data nomenclature as ECO and previous addressing system for known or named opening sequences, as it is for this repo.... just that the title is kind of hitting the right nail..

niklasf commented 2 years ago

Sure. I'll add a note on the README when this is done. All violations are already automatically detected, currently as warnings, and in the future as errors, blocking changes breaking the rules.

niklasf commented 2 years ago

All warnings resolved. Thanks @allanjoseph98!

@ornicar Going forward it's valid to assume that all openings have a unique shortest variation.