praw-dev / praw

PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.
http://praw.readthedocs.io/
BSD 2-Clause "Simplified" License
3.5k stars 458 forks source link

submission_stream and comment_stream only get new data every second request #501

Closed mallin closed 7 years ago

mallin commented 9 years ago

When using submission_stream or comment_stream to consume items continuously (specifically on /r/all but any sufficiently active subreddit should work), every second request includes the params limit, count and before, and returns no new items. Every other request has the additional param after, and it returns items as it should.

Having every second request be redundant seems less than ideal when the API has such strict rate limiting.

To reproduce:

import praw
reddit = praw.Reddit(user_agent="PRAW Requests Demo",
                     log_requests=2)

for item in praw.helpers.submission_stream(reddit, "all", limit=100):
    print(item)

Sample output (after skipping the first 100 "old" items and waiting for "live" ones):

GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i3dca', 'limit': 100, 'count': 1}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i3dca', 'after': 't3_3i3dcb', 'limit': 100, 'count': 1}
status: 200
1 :: What's something that you wanted to try, knew you'd just love it, but en...
1 :: [PS4] Looking for Gorgon Maze CP
1 :: (Spoilers All) Why do we not believe any of the deaths in the Season 5 f...
1 :: Win Soocoo S60 Action Cam
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i3dcf', 'limit': 100, 'count': 2}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i3dcf', 'after': 't3_3i3dcg', 'limit': 100, 'count': 2}
status: 200
1 :: Hong Kong heritage experts conclude that doomed Wan Chai pawn shop is no...
1 :: A moment to appreciate F1's YouTube channel
1 :: Ready for tonight
1 :: Offering: Spanish (native) Seeking: English (native)
0 :: Mini sumo wrestlers face a giant
1 :: How is owning a website with fake profiles like Ashley Madison not fraud?
1 :: [H] 3 Flachion Keys [W] Hella Cologne Stickers
...etc

Praw 3.2.1, Python 3.4.0

(I'm not sure if this is somehow intended behaviour or not. I came across this while trying to find the source of another strange delay when streaming live data using submission_stream or comment_stream. Is there something I should read to understand why helpers._stream_generator was written the way it was, or known issues with it, or something? )

bboe commented 9 years ago

Thanks for looking into this issue. There are two phases when using the stream functions:

1) Try to collect up to limit submissions back in time.

2) Resume from the most recent submission which was the first item returned from phase 1. If in any one of these steps there are more than 100 items we need to use the after parameter to page through up until the former most recent submission.

Then phase two repeats. Often a subsequent request will return the same results as a previous request. That indicates there were no new data returned during that period. In which case we need to increment the count in order to bypass the cache.


I think your output may be resulting in confusion. In order to return the results to you in oldest to newest order, PRAW may first need to make a number of requests in order to obtain all the data since the last time it yielded items. This reason is why you only see item output after two requests. In many cases, however, it should only require one request before it returns data to you. If you never see that, then maybe there is a bug.

13steinj commented 9 years ago

@mallin I believe this is done by how the method works. First it calls get_new with certain params; that's one request. Then it calls it again with the after paramater and subtracts the difference from the first so that you won't end up with duplicates. In my mind there is no way to avoid this.

However; if you use OAuth you can get 60 requests a minute (you'd have to set your rate limiting as such in your praw.ini)

mallin commented 9 years ago

Thanks for the quick response!

To clarify what I didn't make clear before: this is not a problem at "startup", but in the "steady state". Run that code snippet for an hour, and this is still happening. In fact, at startup seems to be the only time it doesn't happen! The first request just has count=0, limit=100, no before or after, and it returns 100 items.

As for this:

Often a subsequent request will return the same results as a previous request. That indicates there were no new data returned during that period. In which case we need to increment the count in order to bypass the cache.

It seems to me that the opposite is happening. There's a request with limit, before and count params set, which yields no items, and then a second request, almost an exact copy of the first, except that the after param is also set. The count param is the same in the second request as in the first (as is the before param). Then, once the second request has returned items, count is incremented for the next request, before is changed, and after is removed. This request then returns no new items, and the cycle continues. You can see this in the sample output above.

In many cases, however, it should only require one request before it returns data to you. If you never see that, then maybe there is a bug.

When I run my sample code, past the first 100 items, this never happens. The cycle as I have described runs continuously, only returning new items every second request.

(A longer sample output, showing nothing different from the original sample output except that this pattern continues unchanged, is here: http://pastebin.com/6kzDKjB7).

bboe commented 9 years ago

It was in fact a bug. I think reddit must have changed how they respond to the count parameter. If you fetch the latest master everything should be good to go. Thanks for filing!

mallin commented 9 years ago

Confirmed: this fixes the issue described (as well as a lot of other issues with delayed items that I was having). Much appreciated. Isn't it great when it's a one line fix?

I think, though, that this might have introduced a regression:

In the steady state (and streaming submissions from /r/all), requests are made with the uniq, before and limit params set. This returns, round about the time of posting (11:30 UTC), between one and seven or so submissions every two seconds, and all is as it should be. (Interestingly, after seems not to be used at all.)

Then, one of these requests (uniq, before and count set), returns no new items.

Whether this is because there just were no submissions in the last two seconds, or because something else went wrong and this is the first sign, I don't know, though I suspect it's that there just were no submissions. This doesn't happen at all when streaming comments from /r/all, presumably because there is no two second period in which no comments are made on Reddit.

This in itself wouldn't be a problem, but the stream then enters a state in which it makes requests with only uniq and limit set, (uniq incrementing as it should), which don't return any new items. This goes on for a number of requests (the most I've seen is 35!). All are HTTP 200, but all return no new items. Then, one of these requests (not looking any different to its ineffective predecessors) works, returning all the items from the "down" period.

This all might be explained more clearly by some sample output (same code as above). Normal operation until uniq=4, then the problem until uniq=22, in which all the missing items are returned.

(...snip...)
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6wdg', 'uniq': 97, 'limit': 100}
status: 200
1 :: Općina Đelekovec, po grbu ne baš najbolje mjesto za izbjeglički kamp.
1 :: Tournament Arena as a feature ?
1 :: Chart showing which auto makers Tesla is stealing customers from
1 :: Ukrainians haven't lost their sense of humour
1 :: 720p 엑스 마키나 토렌.트 torrent 다시보기 재방송 DVD
1 :: overview for fdfdfdsfgthdfws (/r/Fake)
1 :: MIUI 7 Global Beta ROM 5.8.22 For a Range of Xiaomi Smartphones
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6wdq', 'uniq': 98, 'limit': 100}
status: 200
1 :: [H] 50€ PayPal/keys [W] Guardian Pin
1 :: Why I'm quitting CS
1 :: Steenkamp parents criticise verdict
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6wdu', 'uniq': 99, 'limit': 100}
status: 200
1 :: Panic selling grips global markets as stocks crash on China turmoil
1 :: Drax Project - City Lights [Alt. Pop]
1 :: [STORE] Bet & Play Skinzzzz
1 :: 山本裕典、悲痛な胸の内を明かす (モデルプレス):【etce】
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6we1', 'uniq': 0, 'limit': 100}
status: 200
1 :: 1日で桃肌ボディを作るスペシャルケア5つ (モデルプレス):【etce】
1 :: Girls keep rejecting my friend requests....
1 :: 「体が臭う…」はダイエット臭かも!特徴と改善策 (モデルプレス):【etce】
1 :: 7TarD6b
1 :: guess who just watched me
1 :: [H] Up to 165 Keys [W] M9 Slaughter Center diamond
1 :: Keeping runners while going into The Crazy Place...
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6wea', 'uniq': 1, 'limit': 100}
status: 200
1 :: KETOVERSARY UPDATE - so many victories, and some kind of honest downsides?
1 :: Chuck E. Cheese in Kentucky refuses service to uniformed police officer ...
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6wec', 'uniq': 2, 'limit': 100}
status: 200
1 :: A controversial theory: are women and blacks limited in their abilities ...
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6weg', 'uniq': 3, 'limit': 100}
status: 200
1 :: Ishiguro archive heads to Texas
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6wei', 'uniq': 4, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 5, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 6, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 7, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 8, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 9, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 10, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 11, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 12, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 13, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 14, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 15, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 16, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 17, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 18, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 19, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 20, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 21, 'limit': 100}
status: 200
GET: https://api.reddit.com/r/all/new.json
params: {'uniq': 22, 'limit': 100}
status: 200
1 :: [build help] System for both streaming and videocapture
1 :: インフィニティ Q30 と QX30、2016年春に米国発売 (レスポンス):【etce】
1 :: こうのとりISSに到着、油井さんが初キャッチ (産経新聞):【etce】
1 :: Fin24.com | Ashley Madison hack 'a lesson' for SA firms
1 :: [#669|+320|66] Raekwon The Chef says he wants to do a Wu-Tang Clan Biopi...
1 :: Lavrov Sees 'No Major Difference' in Who Becomes Next US President
1 :: Lavrov Criticizes Kiev Over Failure to Implement Constitutional Reform
1 :: Whispers
0 :: Redditors who have moved to another country, what do you miss about your...
1 :: 西野カナ、ツアーファイナルでサプライズ続々 9000人のファン驚く (モデルプレス):【etce】
1 :: Janna by Raichiyo
1 :: As requested yesterday, I'm a Law graduate in the United Kingdom who wor...
1 :: Slow down
1 :: So {f}ucking Horny Tonight 😶💋😏
1 :: Suggestion for the upgraded stick / torch
1 :: The Persistence of the War Party: Why they’ll never go away – and why we...
1 :: Apnea Diver - Training app for free divers, apnea swimmers and recreatio...
1 :: Greg, why did you steal that rock from me?
1 :: UCF Business Degree without all of the career classes?
1 :: Indian Teen Turns Naughty [x-post /r/IndianTeens]
1 :: Russia’s space industry may see further budget cuts — space agency
1 :: A driver's chat
1 :: test post please ignore
0 :: Police violently disperse trash protest in Beirut with water cannons, te...
1 :: Deadlift off the ground or in a cage?
1 :: European partners of Russian Helicopters fully meet their obligations — ...
1 :: No, Munich is not considering ditching Linux and going back to Windows
1 :: overview for marketingmicon
1 :: Help! Eed to replace the fan on a Vulcan Compact 60 central heater
1 :: 【育鵬社教科書】沖縄県石垣市が継続採択 (産経新聞):【etce】
1 :: 皇太子ご一家、那須でご静養 (産経新聞):【etce】
1 :: Inter-Korean talks aimed at preventing future provocations — South Korea...
1 :: UAE Rescues British Hostage amid Yemen Civil War
1 :: Best bank Michigan
1 :: German foreign minister to travel to Iran in October
1 :: Why the Fuck Do Air Shows Still Exist?
1 :: Unpicking The Mail on Sunday's Horrifying Vision of Britain Under Jeremy...
1 :: Senate Minority Leader Harry Reid announces Iran deal support
1 :: More rigs added in North Dakota
1 :: [VHDL question] Loading hex and binary data at the same time from text f...
1 :: Two women die after falling off Wyoming's Teewinot Mountain
1 :: Israel Buys Three-Quarters Of Its Oil From Kurds: Report
1 :: Global Slowdown Fears Spark Flight To Safety As Stock Outflows Hit 15-we...
1 :: Now a link to this page.
1 :: How The Stock Market Plunge Affects Silicon Valley's Startup Market
1 :: [Electronic] Jason Efstathiou - Spotlights (Free DL)
1 :: Closer Than Ever To The Water’s Edge
1 :: Civilizations for Mesopotamia Match Checklist
1 :: Metallica - Nothing Else Matters (Bojan Stipic)
1 :: 日本国内のラッコ、絶滅の危機 水族館に計15匹 (産経新聞):【etce】
1 :: No, Munich is not considering ditching GNU/Linux and going back to Windows
1 :: LA Kings, Christian Ehrhoff Agree To One-Year, $1.5M Deal
1 :: What happens in terms of conductivity and resistance if we get a sheet o...
1 :: -:)[×BOX||OFFICE×]--Sinister:2-^2015:-Full--Movie--^Download.
1 :: Димитрий Смирнов: Можно ли убивать еретиков?
1 :: Huawei Nexus leaked with camera bump
1 :: After a month of playing, results of my scouts aka my current team!
1 :: miwa「めざましライブ」サプライズ演出に会場熱狂 初披露楽曲も<セットリスト> (モデルプレス):【etce】
2 :: My mom adopted a blind Bengal. This is what he does when he gets lost, o...
1 :: 【VW ゴルフ R ヴァリアント 試乗】洗練と高性能が生むクルマの味わい深さ…島崎七生人 (レスポンス):【etce】
1 :: Feedback on my story The God Machine?
1 :: So I was playing a comp game and this happened
1 :: Win some Clif Bar schwag, including a month's worth of samples, from Fit...
1 :: Gym equiptment should keep record of the exercises on them and if you ha...
1 :: Eating Powerbars on the bike
0 :: Pedro: "Chelsea were the club that resolved the situation the quickest a...
1 :: [24.Aug.2015] Femjoy.com x5
1 :: Can I get an opinion on this build I'm doing for my friend ?
1 :: Any word about the MGS:PP PC port?
1 :: Unbox Therapy - Does it Suck? - $15 Amazon Headphones
1 :: AMA - As a EUW player, i lost 2 bets, got TSM flair then now CLG flair.
1 :: Need a ride to school!
1 :: Anime were MC isn't shy and embarrassed about echhi.
1 :: My Logo
1 :: Why Introverts Make Great Entrepreneurs - WSJ
1 :: Is it possible to change Anki's expected solution time for a card?
1 :: Sushi and Sashimi from Katsuya
1 :: /r/SS13 ETHNICITY SURVEY
1 :: [H] Blue Bfk CH FT [W] Keys/Knives
1 :: How President Bernie Sanders would handle foreign policy
1 :: Any advice for cold approach at work? I just want to be in good speaking...
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6whe', 'uniq': 23, 'limit': 100}
status: 200
1 :: 林業省力化へロボット開発、九州工... (西日本新聞):【loca】
1 :: [feedthecollapse] Hop Along -- Tibetan Pop Stars [Alternative Folk/ Indi...
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6whh', 'uniq': 24, 'limit': 100}
status: 200
1 :: Hop Along -- Tibetan Pop Stars [Alternative Folk/ Indie Rock] (2012)
1 :: [Request] Einstein's Relativity Watch
1 :: overview for sfhbdg346fvb34fcxc (/r/shaw)
1 :: LF: Primeape
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6whl', 'uniq': 25, 'limit': 100}
status: 200
1 :: Win some Clif Bar schwag, including a month's worth of samples, from Fit...
1 :: (News Focus) U.S. nuclear bomber deployment crushing pressure on recalci...
1 :: 山形道の上り線、規制解除 (山形新聞):【loca】
1 :: Japanese Cheesecake [OC] [3264x2448]
1 :: Great shot in 'Up' (x-post r/CineShots)
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6whq', 'uniq': 26, 'limit': 100}
status: 200
1 :: Алексей Навальный — Кто оплатил отдых на яхте для Пескова и Навки за 26 ...
1 :: [Submission] Rumours of an attack on the Home system of Kumo crew Harma ...
GET: https://api.reddit.com/r/all/new.json
params: {'before': 't3_3i6wht', 'uniq': 27, 'limit': 100}
status: 200
1 :: Gameplay Video 2002   
1 :: Fruit + Menthol flavors?
1 :: Question about CS: GO and 200+ fps
1 :: Salmon butties with tartare sauce recipe
(...snip...)

This isn't some statistical anomaly where Reddit was silent for 40 seconds, followed by a burst of posting: the posts that come in the burst (eg. uniq=22 here) were made in the time that the requests with uniq=4..21 were returning nothing.

Of course this may be a problem on Reddit's side, but I didn't see this kind of "bunching" at all when looking at the initial problem.

You might have to run the sample code for a while to come across a two second period with no submissions to Reddit, but this "bunching" is happening all the time at the moment, with about 30 in every 100 requests "empty" as described above.

mallin commented 9 years ago

@bboe My post from three hours ago (understandably) got my account flagged as a spammer! Just drawing your attention here again in case it was hidden immediately after I posted it, or anything like that.

bboe commented 9 years ago

@mallin it's probably due to caching. Maybe uniq isn't busting reddit's cache. Since this is relevant to you would you mind exploring?

It's also possible that reddit is only periodically updating its all listing (seems less likely).

mallin commented 9 years ago

Sure, I'll explore and continue to report.

It's harder to reproduce at the moment, probably because Reddit is more active now (20:00 UTC) than it was earlier. So it seems like a request returning no items (for whatever reason) triggers this.

Also note that while the original issue caused inefficient use of requests (and other strange delays), this causes items to be missed entirely: when there is a long enough "dead time" that the 100 items the next working request gets, aren't enough to cover the gap.

bboe commented 9 years ago

Reopened the issue.

So I found out what the original issue was. When using count in conjunction with before, reddit will return both before and after ids to help reach the adjacent listings. Due to the way get_content works any non-zero count (which was occurring) will result in fetching a before page, and immediately fetching the page containing earlier submissions which in all but the rarest of cases would only contain submissions that have already been viewed, hence the second useless request each time.

When count is zero (or not provided) in conjunction with before, both the after and before fields are always set to empty even if there are more than limit listings available that are newer than the before item. By itself, this shouldn't lead to data-loss unless more than 1000 items are added prior to the next update as it should be possible to iterate backwards by updating the before field to the most recent item viewed (this is what submission stream does).

I think the solution is to not utilize get_content in the stream functions, but write a similar reverse_content helper for the stream functions. While get_content isn't really ideal, it should still work as is, thus doesn't explain the caching issues.

bboe commented 9 years ago

Confirmed that the issue does in fact have to do with caching on reddit's end. When running the following script, a cycle of multiple requests occurs due to a fetch of the listing without any before param returning the same data as a previous similar request. This cycle continues until approximately 30 seconds when the cache expires (maybe it's the CDN cache?).

I write similar because I have attempted to modify parameters to break the cache without success. I tried uniq, count, and limit none of which, appear to have made any difference. This bears more exploration.

In case one is wondering why we periodically drop the before param, it is due to the possibility that the before item has been removed. In such cases, the listing will always return empty, thus it's important to retry fetching the base listing when an empty result is returned.

#!/usr/bin/env python                                                                    
from __future__ import print_function
import requests
import sys
import time

def reverse_stream(session, **params):
    url = 'https://api.reddit.com/r/all/new'
    if params:
        url += '?{0}'.format('&'.join(
            ['{0}={1}'.format(k,v) for k,v in params.items() if v]))
    print('\nREQUEST: {0}\n'.format(url))

    response = session.get(url, params=params).json()
    for item in response['data']['children'][::-1]:
        yield item['data']

def submissions(session):
    before = None
    uniq = 1
    while True:
        time.sleep(2)
        newest = None
        for item in reverse_stream(session, limit=100 + uniq % 100,
                                   before=before):
            newest = item['name']
            yield item
        before = newest
        uniq += 1

def main():
    with requests.session() as session:
        session.headers['User-Agent'] = 'BBOE STREAM TEST v0.1'
        for item in submissions(session):
            print(item['name'])

if __name__ == '__main__':
    sys.exit(main())
bboe commented 9 years ago

Aha! Final update for tonight. The cache can be broken by adjusting the limit when hitting the root page with values less than 100. I'll look into rewriting the streams to use the PRAW-version of the above code with the following modifications:

def submissions(session):
    before = None
    root_listing = 0
    while True:
        time.sleep(2)
        newest = None
        if before:
            limit = 100
        else:
            limit = 100 - root_listing
            root_listing = (root_listing + 1) % 15
        for item in reverse_stream(session, limit=limit, before=before):
            newest = item['name']
            yield item
        before = newest
mallin commented 9 years ago

Aha indeed! The version with root_listing seems to work flawlessly.

The previous version has some of the strange delay issues I've been having all along: now revealed as being due to caching.

Updated main to demonstrate delays in case you're interested:

import datetime

def main():
    history = set()
    with requests.session() as session:
        session.headers['User-Agent'] = 'BBOE STREAM TEST v0.1'
        for item in submissions(session):

            item_time = datetime.datetime.utcfromtimestamp(item["created_utc"])
            now_time = datetime.datetime.utcnow()
            recv_delay = (now_time - item_time).total_seconds()

            output = "{0} [{1:+.2f}s]".format(item["name"], recv_delay)
            if item["name"] in history:
                output += " SEEN"
            print(output)

            history.add(item["name"])
svenstucki commented 8 years ago

What's the status of this issue?

I have a probably related issue with PRAW 4. When using subreddit('all').stream.submissions() I get a lot of duplicate entries (e.g. I just got 16k submissions of which 14k were duplicates in about 9 mins).

bboe commented 8 years ago

PRAW4 stream is done completely differently, and as a result this issue will likely be closed when PRAW4 is out of beta.

If you are having issues with PRAW4's please open a new issue and include the code that you are using. The stream should never return any duplicates unless you are calling the stream function more than once, in which case I would recommend not doing so, as you'll get some duplicates, but it shouldn't be 16K.

bboe commented 7 years ago

Closing this issue as this should be fixed in PRAW4.