mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
215 stars 28 forks source link

Queries that specify `before` and `after` can return a different number of results than reported as available by Pushshift #13

Open mattpodolak opened 3 years ago

mattpodolak commented 3 years ago

Test Query:

comments = api.search_comments(
                    after=1606262347,
                    before=1618581599,         
                    subreddit="CovidVaccinated",
                    fields=["id","subreddit","link_id","parent_id","is_submitter","author",
                                "author_fullname","body","score","created_utc","permalink"],
                    limit=None
                    )

Results:

40730 result(s) available in Pushshift
Checkpoint:: Success Rate: 71.00% - Requests: 100 - Batches: 10 - Items Remaining: 33898
Checkpoint:: Success Rate: 79.00% - Requests: 200 - Batches: 20 - Items Remaining: 25661
Checkpoint:: Success Rate: 81.67% - Requests: 300 - Batches: 30 - Items Remaining: 18163
Checkpoint:: Success Rate: 81.75% - Requests: 400 - Batches: 40 - Items Remaining: 11467
Checkpoint:: Success Rate: 82.80% - Requests: 500 - Batches: 50 - Items Remaining: 4262
Checkpoint:: Success Rate: 83.02% - Requests: 583 - Batches: 60 - Items Remaining: 1
Total:: Success Rate: 83.02% - Requests: 583 - Batches: 60 - Items Remaining: 1
1 result(s) not found in Pushshift

Discovered in #12

mattpodolak commented 3 years ago

A potential cause could be how the database is queried during time slicing. The oldest item utc_timestamp is used as a before field when generating subsequent timeslices. Pushshift queries the database using gt and lt for the after and before timestamps.

If multiple items have the same exact same utc_timestamp but are not all returned in a single query (due to 100 item limit), we might expect that the items may not be returned in subsequent timeslices.