justintv / Twitch-API

A home for details about our API
www.twitch.tv
1.72k stars 381 forks source link

Retrieve all streams accurately #535

Closed chambo-e closed 8 years ago

chambo-e commented 8 years ago

Hello, I'm unable to get every live streams accurately at a specific time.

I only use one endpoint: https://api.twitch.tv/kraken/streams My process is kinda simple:

I do understand the fact that some duplicates are returned because of the offset, the refresh rate and the sorting but there is really a lot of them when querying the whole streams set. Eg: Right now API is returning me 16479 live streams and that what is indicated in the _total But when I remove the duplicates I'm down to 12178 streams. That's more than 1/4 of duplicated results.

Is there a way to retrieve currently live streams with a better accuracy ?

tadachi commented 8 years ago

Just a hypothesis,

Streams go offline and live constantly. Considering you have to make more than 200 requests if there are over 20000 streams online (based on 100 limit and offset). Within that window of making requests, some streams go online or go offline causing the duplicates and inaccuracies?

Another thing is viewer count can change drastically within a short window of time. One person may be top 100, but quickly lose a couple hundred viewers and fall into the next 100, causing them to be a duplicate. The window for that sounds small though.

chambo-e commented 8 years ago

Absolutely, you are right but every requests are executed concurrently and it takes less than 3 seconds to complete them all. Missing streams are totally predictable due to the live state of the data but I doubt that a quarter of current live channels goes offline during this very short timeframe. I agree with the sudden viewers count change too but that would happen for a very small amount of streams and generate very few duplicates.

Also I don't know what specific kind of sorting twitch is using on this endpoint but when I query channels with 0 viewers they are always returned in a different order, what might explain some of the duplicates. Could be interesting to add a second dimension sorting field like the channel name to improve results consistency.

asterius1 commented 8 years ago

Most of the weirdness comes from caching of the result. You may for example get part of results cashed half a minute ago and part from just now.

I was trying to solve the same problem, and the best solution I have right now is to go through all of the streams few times in a minute, with each repetition there will be less streams missed (about 3 - 4 repetitions does the job).

Wondering though if there is a better solution. Is it possible that twitch implements some different pagination for it?

chambo-e commented 8 years ago

Wow, that's really beefy, a lot of requests for not so much data. Another solution would be to set an expireAt instead of a timeout for this endpoint. Cache would expire at the same time and the data retrieved will be consistent during this time span

chambo-e commented 8 years ago

I've noticed a decrease in returned duplicated streams since a couples days. Now I only have ~1-2k duplicate.

@asterius1 could you confirm ?

If someone from Twitch is passing by: you made any improvements ?

asterius1 commented 8 years ago

@chambo-e I honestly have no idea, I am not measuring it on production.

DallasNChains commented 8 years ago

There haven't been any functionality changes to this API but there have been scalability improvements. In the future, I would recommend asking these types of questions in the Twitch Developer forums.