mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
212 stars 28 forks source link

Random number of Enhanced Items #30

Closed TheConfax closed 2 years ago

TheConfax commented 2 years ago

With 1873 results available in Pushshift i get a variable enrichment between 200 and 30 items. Does that mean that the other items were as same as Pushshift or that they were not checked?

Code below

#Imports
import praw
from pmaw import PushshiftAPI
from datetime import datetime
from datetime import timezone
import pandas as pd
reddit = praw.Reddit(
    client_id="CLIENT_ID",
    client_secret="CLIENT_SECRET",
    password="PASSWORD",
    user_agent="USER AGENT",
    username="USERNAME")
api = PushshiftAPI(num_workers=20,praw=reddit)

#Set Query Parameters
datewindow_start=datetime(2021,1,1,0,0,0,tzinfo=timezone.utc)
datewindow_end=datetime(2021,1,1,0,0,0,tzinfo=timezone.utc)
query="bitcoin"

#Run Query (running only on 01/01/21 as a test)
all_dates=pd.date_range(datewindow_start,datewindow_end,freq="d")
for date in all_dates:
    start_date=int(date.timestamp()-1)
    end_date=int(date.timestamp()+86400)
    posts=api.search_submissions(q=query,after=start_date,before=end_date)

#Dataframing
comments_df = pd.DataFrame(posts)
df_2 = comments_df[["author","subreddit","num_comments","upvote_ratio","score","title","likes","id","url"]]
df_2.head(30)
mattpodolak commented 2 years ago

By random enrichment do you mean the number of items being reported as enriched ("Finishing enrichment for x items")?

When you store the values in a dataframe are all the values enriched?

PRAW enhancement occurs on multiple threads (during idle time) while retrieving items from Pushshift. When all items are retrieved from Pushshift, enrichment may / may not be completed as well. If there are some items remaining to be enriched, PMAW prints out something like "Finishing enrichment for x items". The number remaining changes due to the variable performance of the APIs.

TheConfax commented 2 years ago

Ok, totally my fault. I thought that "finishing enrichment for x" meant that only x items were enriched. It is all good then?

While I was waiting for an answer I was running tests to compare PMAW enrichment vs classic PSAW/PRAW data retrieval and the max difference in comments and score seems to be +-7. Data is from January 1st 2021 so I don't think it is voted often, what causes this difference?

mattpodolak commented 2 years ago

No problem, I'll close this issue then. Check that the number of items in your dataframe is what you were expecting, when using PRAW enrichment, PMAW will not return unenriched items.

The difference in score is based on how metadata is retrieved, from my understanding Pushshift stores/updates metadata when the item is created, and 24h after creation, it may check a couple more times maybe around a week and 3 months, but I cant remember the timing. PRAW is requesting the latest metadata directly from Reddit via the API.