Closed klop closed 7 years ago
We'll probably do another dump in March.
On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:
Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.
— Reply to this email directly or view it on GitHub https://github.com/yasp-dota/yasp/issues/881.
Maybe with skill data this time! On Feb 16, 2016 10:14 AM, "Albert Cui" notifications@github.com wrote:
We'll probably do another dump in March.
On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:
Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.
— Reply to this email directly or view it on GitHub https://github.com/yasp-dota/yasp/issues/881.
— Reply to this email directly or view it on GitHub https://github.com/yasp-dota/yasp/issues/881#issuecomment-184806963.
With skill data would be awesome.
do we want to make this a quarterly or semiannual thing?
Pushing back because we're doing import right now.
Posting to say this would be good quarterly (unless you get the BigQuery thing updating live). Will you post a blog post when the next dump happens?
If it were up to me I'd probably do semiannual but if @albertcui wants to do it quarterly I won't say no (he's the one having to export/upload the data anyway).
Regarding future dumps: I think at some point after we complete the import we will do a massive pg_dump (this would produce a PostgreSQL-specific dump) with every match ever played (~1.2 billion matches, mostly unparsed). This will also aid us in doing a data migration if we need to move our match data somewhere else (possibly because of Google getting too expensive). Then we can do periodic "addendum" dumps to keep updated records exported. It is up to @albertcui if he wants to continue doing the more generic JSON dumps as well.
We could possibly also get away with not keeping snapshots in Google as well (that would save nearly $100 a month).
ETA for import is 10-15 days.
That'd be great: I'd love to be able to query a db about matches (like the official api, but not limited to the last x hundred games). If I have to download a massive file first that's not really a problem.
I take it opening up an api of your own would have too high a bandwidth overhead?
Yeah, APIs are expensive to operate.
I'm very interested in using the MMR data for machine learning, is this added in this data dump? I suspect one can estimate a players MMR up to a very high accuracy.
@albertcui are you planning to dump player_ratings? Or perhaps export a "snapshot" of current MMR data?
I think would be nice if dumps are somewhat synchronized with Majors. This way they could be release in "know" intervals and somewhat related with big updates.
import is done. Been talking with @albertcui about doing a full dump this time (with every match ever played).
We'd dump matches, player_matches, and match_skill as CSV. Users would have to join the data themselves.
Sounds good
@albertcui I put sample queries in the OP. You may want to try them locally on your devbox first to make sure they work properly.
yasp=# COPY matches TO PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/matches.gz' CSV HEADER;
COPY 1191768403
yasp=# COPY match_skill to PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/match_sill.gz' CSV HEADER;
COPY 132447335
matches.gz
is 146 GB. Currently exporting player_matches.
Update: So we encountered some kind of exception while trying to dump player_matches.
Apparently a fix is to perform a vacuum on the table, so @albertcui did this. It's been running for weeks :(
Started player_matches export again.
COPY 11720437356
Uploading to Amazon Cloud.
From @albertcui :
if someone has the hard drive space, ~1TB, if you could help download our data dump and test it, that would be great. https://www.amazon.com/clouddrive/share/2tSLvE98SNMmuv6wwUaPjYLaYA4Rw16ISzW38yAu8yU?ref_=cd_ph_share_link_copy
the large files are split up into 10 GB pieces because of an amazon limitation. You can cat them back together:
http://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts
The plan is to create torrents for the 3 dump files to minimize the cost of distribution. @albertcui cannot do this until he's back home in August, so hopefully someone here can.
(Please avoid downloading these files unless you're making a torrent from them, since we don't know what kinds of egress caps Amazon has on Cloud Drive)
@fhoffa , perhaps you could help us do this (since you'll probably want the data to upload to BigQuery?)
If I may make a suggestion for future dumps. Consider doing them incrementally, and ideally split by date. That would make the data much more accessible while also minimizing costs of the process.
Moving to August, since that's likely when we'll be able to distribute the files (unless someone wants to create a torrent now)
I am slowly but surely downloading the files from the Amazon cloud. I finished matches_split, and am currently downloading player_matches_split. When I finish and verify the files (btw, md5 would help), I will put up a torrent. It moves rather slowly as Amazon closes the session (and the link) after some time, so I have to use a download manager.
I don't know if it's practical or just creates more problems, but i figured i'd throw this out there:
Would it be more practical to do smaller dumps of data but do it regularly? Personally i can't even download such large files, let alone do much meaningful work with it, i'm usually working with between 25-100k matches i've streamed from the JSON, and find that's usually enough for meaningful data.
If that is more practical, i think it could add a lot of value to have these smaller dumps, but regularly, so there's up-to-date data to work with.
Yeah, I think any future dumps will be "incremental" dumps on top of this one. We don't want to do another one of these full dumps (all together, it ended up costing nearly $800)
Sounds great! Is the plan to do a "500.000 most recent games" dump alongside the full dump, like you did in December? Or will it only be the 1TB file?
Only the big one. We may just delete the older ones to prevent confusion. On Jul 1, 2016 12:34 PM, "Kári Gunnarsson" notifications@github.com wrote:
Sounds great! Is the plan to do a "500.000 most recent games" dump alongside the full dump, like you did in December? Or will it only be the 1TB file?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/yasp-dota/yasp/issues/881#issuecomment-230030583, or mute the thread https://github.com/notifications/unsubscribe/AC_UOIOUEuBSoj0IcfgG8axMh_6Yxn8vks5qRWvXgaJpZM4HbZFE .
As the 1TB file huge. I think would be nice to keep smaller samples of the database, at least the december 500k matches. It's easier to start to work with the matches.
http://xgfs.ru/yasp_06_2016.torrent
Probably will not be there forever, if someone has to ability to rehost (academic torrents? dunno), please do. I will be seeding to the best of my ability.
@nicholashh , can you download and verify the torrent?
Academic torrents just provides a tracker for the file. You'll still need to seed until someone else has a complete copy of the files.
To confirm, there should be 3 .gz archives in the torrent, match_skill.gz
, matches.gz
and player_matches.gz
.
@howardchung I am interested in getting the latest data into BigQuery (I work with Felipe, he will help me with BigQuery but he doesn't play dota so more interesting to me). Here are my questinos:
Thanks, awesome project, hoping to dive into it!
@waprin
sql/create_tables.sql
). This dump contains a superset of the data in the previous dump. Last time, we combined the data from multiple tables into JSON match objects. This time, we opted to dump CSV since it's more flexible.This is a wonderful set of data, I've got it downloaded and have been poking at it -- in the first segment it seems like a large amount of matches do not have skill levels when I do the join, is this expected?
Yes, we only started collecting skill data in May 2015 and this data contains every match (starting in 2010). Dotabuff/Dotamax likely have more skill data since they have been running longer but I am not sure if they will do data dumps.
Re: smaller dump, perhaps someone who downloads the full data set can grab the tail end of each of the 3 files and make a "mini" data set.
@xgfs Could you rehost the torrent file? I can put it on academic torrents then we can actually promote the data dump.
Sorry for the delay; I just recently got home and to a stable internet conneciton.
@waprin We'd really appreciate it if you could possibly host/seed the torrent. I probably can't do it until I get my housing situation sorted out in the next few weeks. On a side note, I'm going to be starting at Google on the 8th, so see you soon!
There were some server issues, they are resolved. The link is up again: http://xgfs.ru/yasp_06_2016.torrent
@xgfs Thanks, can you set the announce URL to http://academictorrents.com/announce.php ?
Sorry, nevermind I've done it! Thanks a lot!
I've uploaded the torrent to Academic Torrents. Unfortunately, I won't have a computer with enough disk space to seed for another week :(
File sizes seem too small (player_matches). Can we wc it and confirm the count?
Also it should be March not June. I believe that is the end of the data.
Correct: for some reason ln created a weird half-sized symbolic link (I use the data from the other disk). I will re-create the torrent now. Sorry for the inconvenience. I'll edit the wc to this post as it is ready.
wc -c match_skill.gz 532083676 match_skill.gz
wc -c matches_split.gz 155941971528 matches_split.gz
wc -c player_matches_split.gz 542664373941 player_matches_split.gz
Any chance for that smaller dataset containing recent matches? I don't think I can store the whole 1TB file on my computer and I'd like to be able to work with a more recent dataset than the December one.
Have most of the data downloaded, uploaded it to bucket, still need to import it to BigQuery, will looking into seeding torrent and making smaller dataset.
@albertcui what location are you starting at?
@waprin I'm going to be in MTV.
@albertcui im in MTV, we should get lunch after you're settled
On Jul 29, 2016 3:17 PM, "Albert Cui" notifications@github.com wrote:
@waprin https://github.com/waprin I'm going to be in MTV.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yasp-dota/yasp/issues/881#issuecomment-236307820, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsKVEE6vFotAcwJvaHs_JBHXKlslBc0ks5qanwSgaJpZM4HbZFE .
Torrent and lengths updated.
Could we get the line counts of the decompressed data? They should match the COPY output from above.
Also, could you remove _split from the file names?
something like cat match_skill.gz | gunzip | wc -l
, which won't require you to store the decompressed data on disk.
@waprin Hi! Any updates on the torrent?
Sorry for big delay, sidetracked by other things.
Byte count on the gzip is the same for matches.gz but not for player_matches.gz. ALso included decompressed sizes:
waprin@unix-instance2:~$ wc -c matches.gz
155941971528 matches.gz
waprin@unix-instance2:/mnt/disks/extra-disk/Downloads$ wc -c player_matches.gz
542664372161 player_matches.gz
waprin@unix-instance2:/mnt/disks/extra-disk/Downloads$ wc -c player_matches.csv
2200436749690 player_matches.csv
waprin@unix-instance2:/mnt/disks/extra-disk/Downloads$ wc -c ~/matches.csv
1253950283105 /home/waprin/matches.csv
I have 4GB truncated files for matches and player_matches as GCS buckets if that helps. Currently putting both small and full dumps into bigquery.
https://storage.googleapis.com/dota-match-dumps/matches_small.csv https://storage.googleapis.com/dota-match-dumps/player_matches_small.csv https://storage.googleapis.com/dota-match-dumps/match_skill.csv
Will try to get line count next as well.
I am also now trying to seed the torrent using transmission, not sure if I'm using transmission wrong or nobody else is seeding currently.
Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.