rafaelcp / smbdataset

Super Mario Bros. (NES) gameplay dataset for machine learning.
Creative Commons Attribution 4.0 International
8 stars 0 forks source link

New-line encoding in RAM data #4

Open ellyhae opened 1 month ago

ellyhae commented 1 month ago

I'm using python to extract values from the RAM snapshots. Because I couldn't get PIL to load the RAM chunk, I decided to extract it myself. However, I noticed that the size of the RAM chunk varied between frames and the values I wanted to extract kept shifting positions. After some digging, I think the RAM values were somehow saved incorrectly, which caused all bytes with the value 13, which is also the encoding of the new-line character '\r', to be replaced by two bytes '\r\n', which is a newer new-line indicator. This of course increases the size of the RAM chunk each time, causing my issues with addressing specific values. My solution is to replace all '\r\n' by '\r' in the range between tEXtRAM and tEXtBP1. This also has the risk of overwriting legitimate '\r\n' values, but so far it seems to work fine.

I don't think my code for loading the data is at fault here, but if you have used the RAM data in python before and have not had issues please let me know.

For reference, here is how I now load the RAM data from a file:

with open(file, 'rb') as f:
    content = f.read()

# extract the chunk containing the RAM data
# start is marked by the label tEXtRAM and a single zero-value
# end is marked by the next chunk, tEXtBP1. in between the two chunks are 4 zero-values and 4 bytes specifying the size of the next chunk
ram = content[content.find(b'tEXtRAM')+len(b'tEXtRAM')+1:content.find(b'tEXtBP1')-4-4]

# it seems the data was incorrectly converted/saved, which caused the line endings \r to change to the more common \r\n, adding an additional byte each time
# replacing all \r\n in the data with \r seems to fix the issue, though there is the risk of overwritng actual data
ram = ram.replace(b'\r\n', b'\r')

# check to make sure the RAM data is the correct size. if it is smaller than anticipated, it's probably because the above line removed actual data
assert len(ram) == 2048
rafaelcp commented 1 month ago

Nice catch, again! I'm sorry for your trouble. As I'm not able to take a deeper look now, I tried OpenAI o1-preview and the result is clarifying: https://poe.com/s/1SnRmRxdzUdOGFjRBHqz It seems your fix is the best you can do right now (in fact, I didn't try to read the RAM data; that should be tested inside the data collector itself to make sure).

Unfortunately, I'll only be able to check and fix this issue by the end of November. Feel free to ping me if I forget.

Btw, as soon as you have some interesting results to show, I'm more that interested in seeing it!

ellyhae commented 1 month ago

I thought about this some more and I believe replacing all '\r\n' with '\r' actually always returns the correct data, instead of having the potential to overwrite intended data.

Some examples of ram, wrongly saved ram, and the result of replacing as described: \r -> \r\n -> \r \r\n -> \r\n\n -> \r\n \r\n\r -> \r\n\n\r\n -> \r\n\r

As for interesting results, we'll have to see. My team and I are looking for timeseries data for a university assignment and we have not yet 100% decided to go with this one, though it seems likely. The goal will be to use different projection algorithms, like TSNE and UMAP, to plot the dataset in 2D and do some further investigations regarding interesting patterns in the plots. The current idea is to use some features derived from the pixel data as the projection input and to use the information from the RAM as metadata. To see if this even makes sense I already created some initial plots and it seems promising: visualization (3) visualization (2) visualization (1) visualization

Note, this uses the row-wise and column-wise averages of the pixels as features. Also, to save on compute, it only includes every fifth frame of the first two levels.

rafaelcp commented 1 month ago

Hahaha very interesting! If you don't mind me suggesting, visualizing success x failure paths or coloring different actions could be interesting too. Also, did you try PCA or DCT as features? Or even maybe random projections (see the Johnson-Lindenstrauss lemma) could work great.

ellyhae commented 1 month ago

Thank you for the suggestions!

Honestly, we haven't tried a lot yet. I only did a quick and dirty checked to see if this dataset could be used with projection methods and has some interesting metadata, before bringing it up in the group for further consideration.

That's also where these kinda weird averages as features stem from: I was only using the data of the first level and didn't want my number of features to be significantly higher than the number of datapoints, so I just went with the first solution that came to mind.

The first part of the assignment will be to try out some projection methods, so we will more deeply think about our feature selection at that point.