odota / core

Open source Dota 2 data platform
https://www.opendota.com
MIT License
1.51k stars 300 forks source link

March data dump #881

Closed klop closed 7 years ago

klop commented 8 years ago

Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.

albertcui commented 8 years ago

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.

— Reply to this email directly or view it on GitHub https://github.com/yasp-dota/yasp/issues/881.

howardchung commented 8 years ago

Maybe with skill data this time! On Feb 16, 2016 10:14 AM, "Albert Cui" notifications@github.com wrote:

We'll probably do another dump in March.

On Tue, Feb 16, 2016 at 1:11 PM, klop notifications@github.com wrote:

Is there any way to get a dump of 6.86 matches only? All I could find were the 500k Dec 2015 and 3.5M dumps.

— Reply to this email directly or view it on GitHub https://github.com/yasp-dota/yasp/issues/881.

— Reply to this email directly or view it on GitHub https://github.com/yasp-dota/yasp/issues/881#issuecomment-184806963.

klop commented 8 years ago

With skill data would be awesome.

howardchung commented 8 years ago

do we want to make this a quarterly or semiannual thing?

albertcui commented 8 years ago

Pushing back because we're doing import right now.

onelivesleft commented 8 years ago

Posting to say this would be good quarterly (unless you get the BigQuery thing updating live). Will you post a blog post when the next dump happens?

howardchung commented 8 years ago

If it were up to me I'd probably do semiannual but if @albertcui wants to do it quarterly I won't say no (he's the one having to export/upload the data anyway).

Regarding future dumps: I think at some point after we complete the import we will do a massive pg_dump (this would produce a PostgreSQL-specific dump) with every match ever played (~1.2 billion matches, mostly unparsed). This will also aid us in doing a data migration if we need to move our match data somewhere else (possibly because of Google getting too expensive). Then we can do periodic "addendum" dumps to keep updated records exported. It is up to @albertcui if he wants to continue doing the more generic JSON dumps as well.

We could possibly also get away with not keeping snapshots in Google as well (that would save nearly $100 a month).

ETA for import is 10-15 days.

onelivesleft commented 8 years ago

That'd be great: I'd love to be able to query a db about matches (like the official api, but not limited to the last x hundred games). If I have to download a massive file first that's not really a problem.

I take it opening up an api of your own would have too high a bandwidth overhead?

howardchung commented 8 years ago

Yeah, APIs are expensive to operate.

mikkelam commented 8 years ago

I'm very interested in using the MMR data for machine learning, is this added in this data dump? I suspect one can estimate a players MMR up to a very high accuracy.

howardchung commented 8 years ago

@albertcui are you planning to dump player_ratings? Or perhaps export a "snapshot" of current MMR data?

paulodfreitas commented 8 years ago

I think would be nice if dumps are somewhat synchronized with Majors. This way they could be release in "know" intervals and somewhat related with big updates.

howardchung commented 8 years ago

import is done. Been talking with @albertcui about doing a full dump this time (with every match ever played).

We'd dump matches, player_matches, and match_skill as CSV. Users would have to join the data themselves.

onelivesleft commented 8 years ago

Sounds good

howardchung commented 8 years ago

@albertcui I put sample queries in the OP. You may want to try them locally on your devbox first to make sure they work properly.

albertcui commented 8 years ago
yasp=# COPY matches TO PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/matches.gz' CSV HEADER;
COPY 1191768403
yasp=# COPY match_skill to PROGRAM 'gzip > /var/lib/postgresql/data/pgdata/match_sill.gz' CSV HEADER;
COPY 132447335

matches.gz is 146 GB. Currently exporting player_matches.

howardchung commented 8 years ago

Update: So we encountered some kind of exception while trying to dump player_matches.

Apparently a fix is to perform a vacuum on the table, so @albertcui did this. It's been running for weeks :(

albertcui commented 8 years ago

Started player_matches export again.

albertcui commented 8 years ago

COPY 11720437356

Uploading to Amazon Cloud.

howardchung commented 8 years ago

From @albertcui :

if someone has the hard drive space, ~1TB, if you could help download our data dump and test it, that would be great. https://www.amazon.com/clouddrive/share/2tSLvE98SNMmuv6wwUaPjYLaYA4Rw16ISzW38yAu8yU?ref_=cd_ph_share_link_copy

the large files are split up into 10 GB pieces because of an amazon limitation. You can cat them back together:

http://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts

The plan is to create torrents for the 3 dump files to minimize the cost of distribution. @albertcui cannot do this until he's back home in August, so hopefully someone here can.

(Please avoid downloading these files unless you're making a torrent from them, since we don't know what kinds of egress caps Amazon has on Cloud Drive)

@fhoffa , perhaps you could help us do this (since you'll probably want the data to upload to BigQuery?)

rossengeorgiev commented 8 years ago

If I may make a suggestion for future dumps. Consider doing them incrementally, and ideally split by date. That would make the data much more accessible while also minimizing costs of the process.

howardchung commented 8 years ago

Moving to August, since that's likely when we'll be able to distribute the files (unless someone wants to create a torrent now)

xgfs commented 8 years ago

I am slowly but surely downloading the files from the Amazon cloud. I finished matches_split, and am currently downloading player_matches_split. When I finish and verify the files (btw, md5 would help), I will put up a torrent. It moves rather slowly as Amazon closes the session (and the link) after some time, so I have to use a download manager.

karigunnarsson commented 8 years ago

I don't know if it's practical or just creates more problems, but i figured i'd throw this out there:

Would it be more practical to do smaller dumps of data but do it regularly? Personally i can't even download such large files, let alone do much meaningful work with it, i'm usually working with between 25-100k matches i've streamed from the JSON, and find that's usually enough for meaningful data.

If that is more practical, i think it could add a lot of value to have these smaller dumps, but regularly, so there's up-to-date data to work with.

howardchung commented 8 years ago

Yeah, I think any future dumps will be "incremental" dumps on top of this one. We don't want to do another one of these full dumps (all together, it ended up costing nearly $800)

karigunnarsson commented 8 years ago

Sounds great! Is the plan to do a "500.000 most recent games" dump alongside the full dump, like you did in December? Or will it only be the 1TB file?

howardchung commented 8 years ago

Only the big one. We may just delete the older ones to prevent confusion. On Jul 1, 2016 12:34 PM, "Kári Gunnarsson" notifications@github.com wrote:

Sounds great! Is the plan to do a "500.000 most recent games" dump alongside the full dump, like you did in December? Or will it only be the 1TB file?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/yasp-dota/yasp/issues/881#issuecomment-230030583, or mute the thread https://github.com/notifications/unsubscribe/AC_UOIOUEuBSoj0IcfgG8axMh_6Yxn8vks5qRWvXgaJpZM4HbZFE .

paulodfreitas commented 8 years ago

As the 1TB file huge. I think would be nice to keep smaller samples of the database, at least the december 500k matches. It's easier to start to work with the matches.

xgfs commented 8 years ago

http://xgfs.ru/yasp_06_2016.torrent

Probably will not be there forever, if someone has to ability to rehost (academic torrents? dunno), please do. I will be seeding to the best of my ability.

howardchung commented 8 years ago

@nicholashh , can you download and verify the torrent?

Academic torrents just provides a tracker for the file. You'll still need to seed until someone else has a complete copy of the files.

To confirm, there should be 3 .gz archives in the torrent, match_skill.gz, matches.gz and player_matches.gz.

waprin commented 8 years ago

@howardchung I am interested in getting the latest data into BigQuery (I work with Felipe, he will help me with BigQuery but he doesn't play dota so more interesting to me). Here are my questinos:

  1. High level summary of the 3 tables and their relationship to what Felipe put in BigQuery last time - https://bigquery.cloud.google.com/table/fh-bigquery:public_dump.dota2_yasp_v1?pli=1
  2. If I want to peek at some rows, is it impossible without downloading the whole thing? Seems like it is because it's gzipped then split up and I think that means you need to cat the whole thing before unzipping but correct me if I'm wrong
  3. Do you still need help hosting or seeding the torrent? Current link looks broken. I will look into my options for doing so.

Thanks, awesome project, hoping to dive into it!

howardchung commented 8 years ago

@waprin

  1. You might want to look at the schema to see what the tables contain (sql/create_tables.sql). This dump contains a superset of the data in the previous dump. Last time, we combined the data from multiple tables into JSON match objects. This time, we opted to dump CSV since it's more flexible.
  2. You could potentially download the first few (or just one), streaming unzip it, and look at the data you can unzip before it fails.
  3. Albert will be back in about a month. The data is still hosted in Amazon cloud drive. We need a torrent made, verified, and seeded. Albert can probably do it if no one has a seed up by the time he returns.
benlee commented 8 years ago

This is a wonderful set of data, I've got it downloaded and have been poking at it -- in the first segment it seems like a large amount of matches do not have skill levels when I do the join, is this expected?

howardchung commented 8 years ago

Yes, we only started collecting skill data in May 2015 and this data contains every match (starting in 2010). Dotabuff/Dotamax likely have more skill data since they have been running longer but I am not sure if they will do data dumps.

howardchung commented 8 years ago

Re: smaller dump, perhaps someone who downloads the full data set can grab the tail end of each of the 3 files and make a "mini" data set.

albertcui commented 8 years ago

@xgfs Could you rehost the torrent file? I can put it on academic torrents then we can actually promote the data dump.

Sorry for the delay; I just recently got home and to a stable internet conneciton.

@waprin We'd really appreciate it if you could possibly host/seed the torrent. I probably can't do it until I get my housing situation sorted out in the next few weeks. On a side note, I'm going to be starting at Google on the 8th, so see you soon!

xgfs commented 8 years ago

There were some server issues, they are resolved. The link is up again: http://xgfs.ru/yasp_06_2016.torrent

albertcui commented 8 years ago

@xgfs Thanks, can you set the announce URL to http://academictorrents.com/announce.php ?

albertcui commented 8 years ago

Sorry, nevermind I've done it! Thanks a lot!

albertcui commented 8 years ago

I've uploaded the torrent to Academic Torrents. Unfortunately, I won't have a computer with enough disk space to seed for another week :(

howardchung commented 8 years ago

File sizes seem too small (player_matches). Can we wc it and confirm the count?

Also it should be March not June. I believe that is the end of the data.

xgfs commented 8 years ago

Correct: for some reason ln created a weird half-sized symbolic link (I use the data from the other disk). I will re-create the torrent now. Sorry for the inconvenience. I'll edit the wc to this post as it is ready.

wc -c match_skill.gz 532083676 match_skill.gz

wc -c matches_split.gz 155941971528 matches_split.gz

wc -c player_matches_split.gz 542664373941 player_matches_split.gz

chudooder commented 8 years ago

Any chance for that smaller dataset containing recent matches? I don't think I can store the whole 1TB file on my computer and I'd like to be able to work with a more recent dataset than the December one.

waprin commented 8 years ago

Have most of the data downloaded, uploaded it to bucket, still need to import it to BigQuery, will looking into seeding torrent and making smaller dataset.

@albertcui what location are you starting at?

albertcui commented 8 years ago

@waprin I'm going to be in MTV.

benlee commented 8 years ago

@albertcui im in MTV, we should get lunch after you're settled

On Jul 29, 2016 3:17 PM, "Albert Cui" notifications@github.com wrote:

@waprin https://github.com/waprin I'm going to be in MTV.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yasp-dota/yasp/issues/881#issuecomment-236307820, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsKVEE6vFotAcwJvaHs_JBHXKlslBc0ks5qanwSgaJpZM4HbZFE .

xgfs commented 8 years ago

Torrent and lengths updated.

howardchung commented 8 years ago

Could we get the line counts of the decompressed data? They should match the COPY output from above.

Also, could you remove _split from the file names?

something like cat match_skill.gz | gunzip | wc -l, which won't require you to store the decompressed data on disk.

SomeSnm commented 8 years ago

@waprin Hi! Any updates on the torrent?

waprin commented 8 years ago

Sorry for big delay, sidetracked by other things.

Byte count on the gzip is the same for matches.gz but not for player_matches.gz. ALso included decompressed sizes:

waprin@unix-instance2:~$ wc -c matches.gz
155941971528 matches.gz

waprin@unix-instance2:/mnt/disks/extra-disk/Downloads$ wc -c player_matches.gz 
542664372161 player_matches.gz

waprin@unix-instance2:/mnt/disks/extra-disk/Downloads$ wc -c player_matches.csv 
2200436749690 player_matches.csv

waprin@unix-instance2:/mnt/disks/extra-disk/Downloads$ wc -c ~/matches.csv 
1253950283105 /home/waprin/matches.csv

I have 4GB truncated files for matches and player_matches as GCS buckets if that helps. Currently putting both small and full dumps into bigquery.

https://storage.googleapis.com/dota-match-dumps/matches_small.csv https://storage.googleapis.com/dota-match-dumps/player_matches_small.csv https://storage.googleapis.com/dota-match-dumps/match_skill.csv

Will try to get line count next as well.

I am also now trying to seed the torrent using transmission, not sure if I'm using transmission wrong or nobody else is seeding currently.