Open KemptonM opened 2 years ago
Reddit's data is incredibly mangled (final snapshot is wrong, CSV data is out of order). I suggest using this torrent instead, which is a minified and reordered version of the CSV data, from Scaevolus on the Place Atlas discord:
[magnet link removed due to issues with data, see Edit 2]
It's still complete, in the correct order, and ~1/5th the size
Edit: BOTH the official dataset and this minified version have incorrect admin rect data, see here
Edit 2: disregard the torrent, we're still working out kinks. Regardless, Reddit's dataset seems incorrect. I will keep you updated.
I might misunderstand - is the data in this torrent still based on Reddit's official CSV data, or does it come from scrapers?
I agree that the data used should probably use the joint torrent data. There are a few scappers out there that could be implemented to collect this data automatically rather than having the user download the dataset.
See here for an example
However, I'd suggest that this should probably be a forked project. I don't have the time to manage this repo as I have my finals in a couple weeks. I'd also be happy to add collaborators to the project to manage pull requests as long as it was handled appropriately
@KemptonM it comes from Reddit's official CSV data, but it's going to take us some time to repair it as Reddit's official CSV data is incredibly mangled. Right now we're working on combining the r/Place timeline data (which has snapshots spaced every 1sec apart) with the CSV data to try and determine the actual order of pixels placed and repair the data.
@lloydowen8 I'd be fine taking over the project if you'd like, I just started a new quarter at UCSD so I'm fairly busy myself, but I've been making time to work on a lot of r/Place related projects in the last few days, and I think I've been the most active person in this repo anyways. I can fork it or you can transfer ownership to me, your choice.
This has been addressed in the standalone script. However, the notebook should also be adapted to use this data.
The current data set excludes some snapshots at the very beginning and just before the whiteout. Reddit's data is complete from beginning to end