praw-dev / praw

PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.
http://praw.readthedocs.io/
BSD 2-Clause "Simplified" License
3.52k stars 462 forks source link

Unreleased Feature inquiry #2025

Closed xDido closed 3 months ago

xDido commented 3 months ago

Describe the Documentation Issue

Hello, Praw community,

I would like to thank you for your efforts made in this product.

What I'm trying to do is to scrape as much as I can from [r/Egypt] to collect some Arabic text data to create a custom Arabic dataset for a university project. when I try to scrape the subreddit top using

for submission in subreddit.new( limit=None)

it give me the same 673 posts with their respective comments then the listing generator ends.

I make a new call after 1 minute to try to fetch more posts. but I end up having the same ones.

is there a way to start scrapping from certain point in the subreddit instead of scrapping the same ones over and over.

I have seen in the unreleased version documentation that the stream_generator() function accepts a parameter called "the continue_after_id ", wondering if this might be helpful in my case, and if so how may I access this version because this feature is not available in 7.7.1.

Thanks in advance,

Attributes

Location of the issue

Unreleased, Inquiry

What did you expect to see?

Helpful advice, and explanation regarding the unreleased changelog.

What did you actually see?

unreleased changelog.

Proposed Fix

Helpful advice, and explanation regarding the unreleased changelog.

Operating System/Web Browser

Windows, Chrome

Anything else?

Thanks

LilSpazJoekp commented 3 months ago

The majority of listings on Reddit are limited to 1000 items. This is a hard limitation and there isn't a way around it using Reddit's API. The number of items you're getting is likely due to some of those posts being removed by Reddit's spam systems, AutoModerator, or shadowbanned users' posts. Utilizing continue_after_id is really only for instructing the generators for stream functions where in the listing to start yielding items from.

Re unreleased features: We're planning on making a release soon. However, those features are unlikely to resolve your issue.

xDido commented 3 months ago

Thanks for the speedy reply, truly appreciated!

I will try to search if I can get posts starting from "certain date" or "before a certain date ". If I can't, I will try to use selenium .

I would also appreciate suggestions if you have any. Thanks LilSpazJoekp,

LilSpazJoekp commented 3 months ago

I will try to search if I can get posts starting from "certain date" or "before a certain date ".

This isn't possible either. It used to be but was ultimately removed by Reddit.

If I can't, I will try to use selenium .

Browsers have the same limitation.

I would also appreciate suggestions if you have

If you're a moderator or researcher you can request access to pushshift. Otherwise, there isn't much else you can do unless you capture the posts as they are posted yourself.