meraki-analytics / cassiopeia

An all-inclusive Python framework for the Riot Games League of Legends API. Cass focuses on making the data easy and fun to work with, while providing all the tools necessary to create a website or do data analysis.
MIT License
552 stars 135 forks source link

Unclear if I'm making unnecessary API calls #146

Closed pfmoore closed 6 years ago

pfmoore commented 6 years ago

I'm experimenting with the new Cassiopeia API and I'm unsure if I'm doing things correctly. From my reading of the documentation, the default pipeline includes an in-memory cache, which I assume means that repeated calls to the same API won't be needed. But experimenting, I see the same API being called multiple times:

>>> import cassiopeia as cass
>>> cass.set_default_region("EUW")
>>> gustavenk = cass.get_summoner(name="GustavEnk")
>>> len(gustavenk.match_history)
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=0&endIndex=100
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=100&endIndex=200
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=200&endIndex=300
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=300&endIndex=400
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=400&endIndex=500
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=500&endIndex=600
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=600&endIndex=700
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=700&endIndex=800
763
>>> gustavenk.match_history[1]
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=0&endIndex=100
<cassiopeia.core.match.Match object at 0x00000000050C6DA0>
>>> gustavenk.match_history[1]
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=0&endIndex=100
<cassiopeia.core.match.Match object at 0x00000000050C64A8>
>>> m = gustavenk.match_history[1]
Making call: https://euw1.api.riotgames.com/lol/match/v3/matchlists/by-account/23549260?beginIndex=0&endIndex=100

There are multiple "Making call" messages, all for the same URL.

My reason for being concerned about API calls is that I'm not clear to what extent I can set up a persistent store in the pipeline and end up being able to run a data analysis program twice, the second time needing no access to the actual API and serving everything from the store. The alternative would be to write custom code to dump all of the data collected by the API into a separate database, and ignore the persistence framework. (This may be needed anyway, to an extent, as I've not yet determined how the "recent matches" API would handle being called at a later time with a persistent store that's from a previous time - but even if I can't do this, I'd like to still just store match IDs and rely on the store to avoid API access for match data by ID).

(BTW, is there a way to set my default region as an environment variable or similar? Without needing to load a non-default settings file, which is no easier than setting the region in my code, as far as I can tell...)

jjmaldonis commented 6 years ago

From my reading of the documentation, the default pipeline includes an in-memory cache, which I assume means that repeated calls to the same API won't be needed. But experimenting, I see the same API being called multiple times:

You're correct in reading the documentation that the in-memory cache prevents the same calls from needing to be executed more than once. There are a few exceptions, and matchlist/match history is one of them. The reason you're seeing the same URL hit multiple times is because Cass is using different parameters to access different parts of a summoner's match history each time. This is necessary since Riot changed matchlist to only return at most 100 games for each call the user makes. If you don't want the entire match history, simply stop iterating over it after 100 matches and you will only make one call to the Riot API.

My reason for being concerned about API calls is that I'm not clear to what extent I can set up a persistent store in the pipeline and end up being able to run a data analysis program twice, the second time needing no access to the actual API and serving everything from the store.

You can absolutely set up a persistent store in the pipeline and only the pull the data from Riot once. That's exactly what it's meant for, and we have a simple disk-based database that I've been using that does this really well (and I can run data analysis programs while completely disconnected to the internet). However, again there are a few exceptions to what data we store permanently. The match history of a summoner is one such example: It's non-trivial to know if a summoner's match history has been updated, and therefore when new data needs to be pulled. Therefore right now we don't cache or store match history results for a summoner, and every time a program needs to be run the match history needs to be pulled again. However, just about everything else is stored properly! Match history is the only exception I can think of.

There are some ways to properly cache or permanently store a summoner's match history, but it requires quite a bit of new logic. This is on our todo list, but not super high. We would need functionality to store partial-match-histories and then need logic for updating that match history whenever a summoner plays a new game.

This may be needed anyway, to an extent, as I've not yet determined how the "recent matches" API would handle being called at a later time with a persistent store that's from a previous time - but even if I can't do this, I'd like to still just store match IDs and rely on the store to avoid API access for match data by ID.

I didn't exactly follow this, but it sounds like you might want the functionality to store partial match histories. In that case, storing the match IDs separately would be one solution (assuming you don't need the other data that comes with the match history), and you could use a separate database tightly coupled to the datapipeline to handle all your other data storage.

BTW, is there a way to set my default region as an environment variable or similar?

No, there isn't a way to set the default region via an env var. You have to set it in your configuration/settings, set it programmatically like you did in your example, or specify the region for all objects individually.

pfmoore commented 6 years ago

Ah, cool. I wondered if it might be because the results change. I'm perfectly OK with using the match history API to get match IDs, and then using the match API to get the actual data. That should mean that it's only my crawler process to extract match IDs that would be non-cacheable, and that's fine with me. It actually matches well with the actual use case I have here, which is to crawl for a big batch of sample matches on a one-off basis, but then analyze the actual match data multiple times - it's the analysis step that would run multiple times.

Thanks for the quick response and the clarification. Now that I know I'm not going down completely the wrong track, I'll build some more complete code and see how I get on :smile:

jjmaldonis commented 6 years ago

Ok great. Your plan should work great with Cass, and you might even be able to do your analysis entirely offline after the first time. Let us know if you run into any problems -- we really want Cass to work as best as possible for what you're doing so we are happy to improve things if there are issues.

jjmaldonis commented 6 years ago

I added a section at the bottom of one of the doc pages mentioning that Match History behaves a bit differently than most other Cass objects. I'll close this but feel free to reopen if you want to for some reason.