mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
211 stars 28 forks source link

signal only works in main thread #48

Open rosendyakov opened 1 year ago

rosendyakov commented 1 year ago

Hello, I'm currently developing a very simple Flask app, running only locally and I wanted to scrape some Reddit posts, using your API. I followed the example, as it's specified in the documentation, however whenever I run my script, I get the following error:

ValueError: signal only works in main thread

I read that Flask-SocketIO package causes this, but I saw that this project uses Websocket-client, which is a different package.

Would really appreciate your input.

mattpodolak commented 1 year ago

hey @rosendyakov can you provide the following info, and the minimum amount of code needed to re-create the issue? This will help me as I look into this further:

python version: flask version: pmaw version:

SeifReda30 commented 1 year ago

I have the same issue in deploying a web application using pmaw

File "/app/adam-radar/Python-Scripts/User Specified Scripts/Discussion Platforms/Reddit/reddit_submissions_by_keywords.py", line 111, in reddit_submissions api_request_generator = api.search_submissions(q=keyword,after=start_time,before=end_time) File "/home/appuser/venv/lib/python3.9/site-packages/pmaw/PushshiftAPI.py", line 77, in search_submissions return self._search(kind="submission", **kwargs) File "/home/appuser/venv/lib/python3.9/site-packages/pmaw/PushshiftAPIBase.py", line 304, in _search self.req.check_sigs() File "/home/appuser/venv/lib/python3.9/site-packages/pmaw/Request.py", line 110, in check_sigs signal.signal(getattr(signal, "SIG" + sig), self._exit) File "/usr/local/lib/python3.9/signal.py", line 56, in signal handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))

mike-mo commented 1 year ago

I am hitting the same symptom, though my setup is a little bit more involved (Azure Durable Functions), so in order to make up for the added complexity I published my repro to https://github.com/mike-mo/azure-durable-pmaw

Python version: 3.10.10 Azure Functions Core Tools Version: 4.0.5030 Azure Functions Runtime Version: 4.15.2.20177 pmaw version: 3.0.0

Same Python version and pmaw version work fine to run a basic script that fetches the information. It must be something to do with how threading is handled by these frameworks.

CryptoRahino commented 1 year ago

having the same issue here, i'm using multiprocessing.pool.ThreadPool to call the api function `def run_download(subreddits: list, start_date: int, end_date: int, additional_args: dict, working_dir: Path = None) -> DataFrame: logger.info(f"Starting Download from Reddit using subreddits {subreddits}") all_df = DataFrame()

with ThreadPool() as pool:
    query = {'start_date': start_date,
             'end_date': end_date,
             "working_dir": working_dir,
             **additional_args}
    lst_df =pool.starmap(_get_subreddit, [(start_date,end_date,  subreddit) for subreddit in subreddits])

    for df in lst_df:
        if df.empty:
            continue
        all_df = concat([df, all_df], axis=0)

def _get_subreddit(self, start_time: int, end_time: int, subreddit_name=None, **kwargs) -> DataFrame:
    params = self.params.copy()
    params.update(kwargs)
    subreddit_name = subreddit_name or self.subreddit_name
    df = DataFrame(
        self.api.search_submissions(subreddit=subreddit_name, since=start_time, until=end_time, **params))
    return df.drop_duplicates('id')

        `

i have tried this with async function but it also didn't help. ValueError: signal only works in main thread of the main interpreter

simoninnyc commented 2 months ago

Did you manage to solve this? Running into a similar issue using an Azure Durable Function with Scrapy and signal