project-slippi / Ishiiruka

GNU General Public License v2.0
330 stars 142 forks source link

Generating SLP files should include a unique ID #71

Open vinceau opened 4 years ago

vinceau commented 4 years ago

Whenever Dolphin creates a new SLP file, the SLP file should have a unique ID. As far as I can tell, IDs are not included in SLP files (please correct me if I'm wrong and close and disregard this PR if that's the case).

Right now, if you wanted to uniquely identify an SLP file, one could currently use the filename (which is not a good way to check for duplicate SLP files), or the actual parsed JSON contents itself. Having a unique key for each SLP file would be much better than this and would help pave the way for a cache of sorts for the Desktop app among other potential useful applications.

We could naively generate some random base64 encoded string as an ID. However, then two Dolphins playing the same game together online would generate two different IDs which is not ideal. We could use a seeded random for example using the time, stage, player characters, player tags if specified, or a hash of those things in order to generate a unique ID. This would allow the two Dolphins to generate the same ID.

NikhilNarayana commented 4 years ago

I guess the host could decide on the ID before the game and pass it onto the client. Is there a good case for this though? Trying to think what this would be useful for.

vinceau commented 4 years ago

Being able to identify an SLP by an ID has many uses but a canonical example is caching in the desktop app (or even Clippi). If we process an SLP file and store some data associated with it, we need to be able to associate the data with the SLP file. We can't use the filename as the unique ID because the filename can change and filenames can have conflicts.

I don't think the host code would be unique enough and also only exists in netplay not console. Maybe the unique ID could be something like: hash(timestamp, hostcode (if netplay), stage, player tags (if applicable), character ids and colors) and we could probably store the ID in the metadata.

ehandal commented 4 years ago

Would it be possible to use the hash of the whole .slp for caching schemes?

vinceau commented 4 years ago

Would it be possible to use the hash of the whole .slp for caching schemes?

I don't think it's necessary to hash the entire SLP file. I think just hashing the previously mentioned attributes would be sufficient for ensuring uniqueness. Possibly including the final frame number in the pre-image may also be worth considering.

NikhilNarayana commented 4 years ago

I like vince's idea or maybe letting the host (decider) generate a UUID.

eigenform commented 4 years ago

My 2 cents: generating some UUID independent of the actual data seems fine, but only if whatever caching you're doing doesn't totally rely on being able to have a perfect, unambiguous description of a particular game of Melee. It'd be trivial for users to just write over UUIDs in the metadata and turn one game into another.

vinceau commented 4 years ago

My 2 cents: generating some UUID independent of the actual data seems fine, but only if whatever caching you're doing doesn't totally rely on being able to have a perfect, unambiguous description of a particular game of Melee. It'd be trivial for users to just write over UUIDs in the metadata and turn one game into another.

I think there's something nice about having the unique ID represent the game itself rather than the SLP file. such that two different Dolphin's playing the same game would generate the same ID, in a deterministic way (perhaps a SHA2 hash of a set of attributes). This is really useful if you were say to build a database where you store user replay files. Two players would upload two different SLP files but of the same game, and if you only wanted to store only a single copy, having them both have the same ID would assist with deduplication.

I doubt users would have any need to modify a game's ID though, but if they did, making the unique ID based on attributes such as start time, characters, end frame etc, would as a byproduct allow the ID to also act as sort of a checksum for game data validity.

In a distant future, rather than referring to a particular game as "Game 4 from the Losers quarters set between Mango and aMSa at The Big House 12" you could potentially refer to the game by the ID like "Game 7b0143b" or if we had human-readable ID mappings, you could refer to them as like "GiantPinkFlamingo" etc. Using the hash of said data, you would have a guaranteed way of ensuring a unique ID will only ever refer to that particular game.

CharlesAMiller commented 4 years ago

There's been some discussion of this issue on Discord. Attached are some excerpts that outline some concerns and considerations.

hash1 hash2 hash3 hash4

cnkeats commented 2 years ago

I wanted to bring some attention back to this issue since I have an interest in using UUIDs for a project I am working on that involves uploading replays for storage and analysis.

I agree with @eigenform in that the ID does not need to be a representation of the entire game's data. This would mean that in the event of a netplay desync, two files will contain info about two "different" games. I actually view this as a positive since it lets you match up desyncs easily, so I think this is the way to go.

The biggest problem that I see is having a guaranteed way to generate UUIDs across different platforms - console, local dolphin, and slippi netplay.

There are some immutable things about games that we can take advantage of in creating a hash, namely the entire Game Start block. This will cover the vast majority of games since not only will the block differ in several places for most games, it also includes the Random Seed for the game which is a uint32. If some a hardware id can be accessed on each of the platforms that differs for each console and computer, adding it to the hash generation would make the hashes never clash.

Using the start block to generate the hash also has the advantage in that it can be generated at the very start of the game; no need for post-game analysis of the file or inserting into metadata.

JLaferri commented 2 years ago

I wanted to bring some attention back to this issue since I have an interest in using UUIDs for a project I am working on that involves uploading replays for storage and analysis.

I agree with @eigenform in that the ID does not need to be a representation of the entire game's data. This would mean that in the event of a netplay desync, two files will contain info about two "different" games. I actually view this as a positive since it lets you match up desyncs easily, so I think this is the way to go.

The biggest problem that I see is having a guaranteed way to generate UUIDs across different platforms - console, local dolphin, and slippi netplay.

There are some immutable things about games that we can take advantage of in creating a hash, namely the entire Game Start block. This will cover the vast majority of games since not only will the block differ in several places for most games, it also includes the Random Seed for the game which is a uint32. If some a hardware id can be accessed on each of the platforms that differs for each console and computer, adding it to the hash generation would make the hashes never clash.

Using the start block to generate the hash also has the advantage in that it can be generated at the very start of the game; no need for post-game analysis of the file or inserting into metadata.

You are bound to get a duplicate eventually doing this.

blasphemetheus commented 2 years ago

TLDR: Thoughts on the utility of having unique IDs meant to identify games

Ok so correct me if I'm wrong but there's nothing guaranteeing the start block not to duplicate. The Random Seed will also duplicate periodically so if both happen while the same hardware ids are playing then a duplicate GameStartID will appear.

The trick is to find what will Never duplicate under normal conditions, not to add a bunch of things that probably won't. Not sure on what that could be. @cnkeats 's thought is that Local Dolphin, Console, and SlippiNetplay differences is a source of complications. Could you explain why on that? Couldn't you just include the platform as a prefix in the ID or use it to make the ID if it's a hash?

Having a unique id available seems intuitively useful:

* Hell you could train agent-bots that attempt to find desyncs if you reduce identifying desyncs to a function. IDs seem like a prerequisite to doing that ** or skip to last frame from the GameStartID and look for a GameEndID and compare those IDs and if they're not the same, start the raw data comparison and if it is the same in both games assume there was no desyncs? Not clear to me if desyncs ever happen and then randomly resync.

The convention to look at LastFrame and pull information useful for stats out of there seems rough shod but it works so shrug. GameEndID I'm not sure if it's necessary tbh, but if the LastFrame peeking convention is ever formalized into something else, that idea might go along with having a GameEndID.

beaudrychase commented 11 months ago

It seems that there are irreconcilable requirements for this proposed ID that makes it impossible to exist. As it is specified it needs to fulfill the following:

  1. IDs must encode significantly less information than their source, i.e they are a fixed length hash.
  2. IDs must be unique.

These specs are impossible through math/information theory. For the IDs to be unique you need to have injective function mapping IDs and slippi files otherwise you'd have two IDs for a single slippi file or two slippi files sharing the same ID. Injective functions have the property that they can be undone, so in this case this means that we need to be able to recover a slippi file from the generated ID, and this is impossible unless the ID has enough information to do so.

There are two ways for the specs to be loosened:

  1. The first constraint is loosened so slippi files are mapped one-to-one to IDs. In this case the IDs wouldn't be a hash, it would be a lossless compression. The output could not be fixed length and it wouldn't serve the original purpose of this PR very well.
  2. The second constraint is loosened allowing for duplicate IDs. This is how IDs are typically generated and collisions are just a fact of life.
blasphemetheus commented 11 months ago

Ah! Been a sec. Uh. Yes. Collisions are the thing that must be accepted.

This is probably an example of excitability when thinking about things on my part :)