Custom Replay Parsing (uploads)

sozuoka commented 8 years ago

Back in early days, YASP had the option for users to upload their own replay, but that feature was removed later. Is it possible to re-add it? It's a valuable feature for small communities, so they can parse their private lobby matches. I think if it's a exclusive feature for Cheese users, it won't put too much stress on YASP's servers.

TODO:

[x] build frontend form on request page to accept files
[x] accept uploads on server side (web.js), set max upload size to prevent abuse
[x] figure out storage of replay blobs (Redis)
[x] extend parser.js/runParse.js to accept uploaded replays (not downloaded from a URL)
[x] figure out storage of result blobs and modify parser to insert into Redis
[x] figure out reconstruction of API fields
[x] figure out how to route to the match and display using same templates as regular matches (avoid duplicating code), but reading from Redis

howardchung commented 8 years ago

It's not as much the load as it is that the API doesn't have data for these matches. We use the API for the "Overview" tab which we expect to be there on all matches.

Even when we allowed upload it did not work on lobbies. It was mainly useful for parsing replays that had already expired but someone downloaded it first.

Options:

Client-side parse:
- We have the replay parsed in the browser via JavaScript and displayed in JS views. The data would reside completely on the user's computer (no upload), so we don't have to worry about inaccurate/weird match data going into the DB.
- However, you wouldn't be able to share a link to the match. You'd have to do something like save the HTML or take screenshots.
- The annoying part is having to rebuild all the views since the Jade templates will not work for client-parsed data. The parser would also have to be rewritten to use rapier (the JS parsing library I never finished).
Server-side parse:
- Have to regenerate the API data from the replay parse and probably wouldn't be completely accurate
- Bit of a privacy concern since the players lose anonymity (option, make all the account_ids anonymous so we don't screw up aggregations/reveal identities)
- We have to handle bad inputs gracefully (people trying to upload really big files, etc).
- If we did this I would like to sandbox these from the rest of the matches so we don't mess up aggregations, etc.
- Example, someone could play and win a ton of lobby matches by getting their opponents to throw. Then they mass-upload these to inflate their stats.

I'll leave it in the backlog, feel free to share ideas.

howardchung commented 8 years ago

Server side is probably more practical.

howardchung commented 8 years ago

After some more thought, we can probably do the following:

Upload replay as binary blob into redis
Extend parse jobs to accept inputs stored as binary blobs
Save uploaded parse into temporary storage (Redis also?)
Make these parses reachable through their own url, maybe something like (/uploads/random string)
Do best-effort computation of the overview data from the parse

The uploaded parse would only be available for however long we set the retention to, and will be sandboxed from aggregations.

Does that meet your needs?

sozuoka commented 8 years ago

That would be great! At first I didn't think that this task requires that much work on you guys,and I really appreciate your reply. Setting a limited storage time for the parsed custom replays is the right choice too, since YASP has to manage a lot of public replays already.

howardchung commented 8 years ago

We have to reconstruct some fields from the replay data since we won't have API access for lobby matches:

{
result: {
error: "Practice matches are not available via GetMatchDetails"
}
}

Things we have to patch back into the match from the replay hero_id (otherwise you cant figure out what hero goes in each row) duration (purchase/kill timelines rely on this to know how many columns to render)

Things we could do optionally: kills deaths assists last_hits denies gpm xpm items (more work since parser does not handle items atm)

Things we probably won't: account_id (privacy issue) game_mode lobby_type cluster ability_upgrades (hard) tower_status/barracks_status

sozuoka commented 8 years ago

I don't know much about this, but logically account_id, game_mode, lobby_type are out question due to the nature of practice match/private lobby. But is ablility_upgrades really that hard to parse?

howardchung commented 8 years ago

Account ID is actually possible. The replay data contains the information. I believe it has game mode as well.

For ability upgrades presumably the data is there but we would be starting from scratch since that's something we haven't had to implement before. @spheenik may know more.

howardchung commented 8 years ago

Almost done. Need to make the form look nice and integrate with the existing request form. Also needs kubernetes migration in order to give parse nodes access to Redis.

spheenik commented 8 years ago

Do you still need me? I don't know what you mean by ability upgrade...

howardchung commented 8 years ago

Skill build order On Jan 20, 2016 1:23 PM, "Martin Schrodt" notifications@github.com wrote:

Do you still need me? I don't know what you mean by ability upgrade...

— Reply to this email directly or view it on GitHub https://github.com/yasp-dota/yasp/issues/830#issuecomment-173363275.

spheenik commented 8 years ago

You mean the time when an ability was skilled? Should be easy... Listen for OnEntityUpdated, check if the entity is an ability, if so, check if one of the properties updated is the level of the ability, if so, emit a record.

sozuoka commented 8 years ago

That's good to hear, I'm really looking to testing this new feature myself!

howardchung commented 8 years ago

I think for now we will just leave most of the overview blank. I'm guessing you're more interested in the parse data and you can get the overview from the client anyway

howardchung commented 8 years ago

Pull request made. Let's get some eyes on the code? We cannot deploy to production until kube migration is complete.

sozuoka commented 8 years ago

I have two small questions: 1) as I understand, normally hero damage/tower damage is included in data given by game API. Does that mean if we want to have it for custom replay parsing, hero damage/tower damage have to be "manually" calculated? 2) Is it possible to make a movement heatmap for the whole game's duration (instead of only first 10 mins right now)? This is actually my suggestion for the public parser too (unless if it put too much load/stress on your servers, of course).

howardchung commented 8 years ago

Yes
Full game position data is expensive to store because it takes up a lot of space.

howardchung commented 8 years ago

I was able to add some of the data such as level, K, D, A, gpm, xpm. Some is still missing such as HD, TD, HH, items, ability upgrades, and tower status.

howardchung commented 8 years ago

Current design:

We take the uploaded file and MD5 hash it to generate a key.
We use this key for storing the replay blob in Redis, the JSON match blob, and for routing the user to the match.
Replay blob expires after an hour.
JSON match blob expires after 7 days.
parser.js "mocks" the URL by serving the Redis blob though an Express server (localhost)
Example: /matches/99bd628c72e82b5e8524ae661ca417d7

sozuoka commented 8 years ago

From my knowledge, that seems like quite a lot of progress so far :D But imo only 1 hour of replay blob storage is a bit too tight, do you think extending it to 2 or 3 hours is possible?

howardchung commented 8 years ago

We don't need the blob for very long, just long enough to parse it.

howardchung commented 8 years ago

http://colab-sbx-244.oit.duke.edu:5000/request if you want to beta test

sozuoka commented 8 years ago

Oh wow, this is much better than I expected. I would say you can release this feature as it is right now, but regardless, I still have a few questions: 1 - I don't know about Hero Healing and Tower Damage, but isn't it easier to calculate hero damage by combining the stats provided by the combat log - like in Combat tab? 2 - Is it possible to add drafting info (the ban/pick order and specific heroes) of Captain's Mode? This is actually what I had in mind when I first requested the feature. 3 - Are you planning to add the abilities upgrade, or just omit it completely? Personally I think it's great if we can have it, but it's not top priority either.

howardchung commented 8 years ago

This requires Redis access on parse nodes so we need to complete the Kubernetes migration first. I think we're going to try to do that this weekend.

I guess we can attempt to sum the hero damage from the combat log, but in general that differs from the damage reported in the API for some reason. I'd rather be consistent, and people can manually sum the damage if they want a ballpark figure.
I think the replay does have drafting information in the "epilogue" packet but it needs to processing to extract since it's in a weird format. Maybe I'll look into it later.
I don't really want to add the ability upgrades data. It adds quite a bit of overhead to the parser since we need to check ability entities (which we don't do at all right now).

muratsu commented 8 years ago

Did you guys think about malicious demo files? I'm a bit concerned about trusting 3rd parties completely. Im pretty sure demo parsers werent coded with security in mind.

howardchung commented 8 years ago

Might be a question for @spheenik. I don't think there is any risk in running any binary through clarity. It'll just crash if it's malformed--we're not treating the uploaded file as an executable.

muratsu commented 8 years ago

Im more concerned about denial type of attacks (infinite loops, etc)

howardchung commented 8 years ago

As far as I know clarity will keep advancing the file stream pointer as it parses, so this can't happen unless the file is of infinite size.

I set a limit of 100MB on uploads. Some replays are larger, but those are generally 2h+ games, which are pretty rare.

spheenik commented 8 years ago

I am certain clarity can be crashed in a gazillion ways when handing it malicious data. A dota replay is a very fragile thing: a single bit off, and you probably can't recover.

I have a hard time thinking of real exploits. Most times, it'll just run into an error, and stop processing. You could certainly craft data that will make clarity read as much data as you specified, but with the upload limit, each attempt is limited to 100MB. Making it loop forever and eat cpu uselessly might also be possible, but one would have to take a look at how to craft it.

howardchung commented 8 years ago

I think we're going to roll this out next week.

odota / core

Custom Replay Parsing (uploads) #830