Slow running time for scrapping all subs for an author

ReichYang commented 2 years ago

Hi. I'm trying to use pmaw to iterate a list of usernames and scrape all their submissions and comments.

It turns out to be very slow.

Using pmaw:

gen = api2.search_submissions(author='seeellayewhy').  
list(gen)

It took 7s to scrape 300+ submission

Using pmaw, it took several minutes to finish it:

from pmaw import PushshiftAPI

api = PushshiftAPI(num_workers=40, )

submisions = api.search_submissions(author='seeellayewhy', limit=None)
submission_list = [sub for sub in submisions]

WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:325 result(s) available in Pushshift
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 40 - Batches: 1 - Items Remaining: 325
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 80 - Batches: 2 - Items Remaining: 323
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 120 - Batches: 3 - Items Remaining: 318
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 161 - Batches: 5 - Items Remaining: 300
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:1 result(s) not found in Pushshift
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 201 - Batches: 6 - Items Remaining: 271
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 241 - Batches: 7 - Items Remaining: 267
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 281 - Batches: 8 - Items Remaining: 252
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 321 - Batches: 9 - Items Remaining: 238
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 361 - Batches: 10 - Items Remaining: 221
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 361 - Batches: 10 - Items Remaining: 221
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 400 - Batches: 11 - Items Remaining: 0

Is there something I'm not doing it right, or it's not recommended to do this kind of task with pmaw? From the logging, it seems like the each requests are requesting only several records? Does it has to do with search windows? If so, how to configure for this kind of tasks.

mattpodolak commented 2 years ago

Hi @ReichYang can you try running again with the default number of workers (10)? For a small number of results using more workers can slow the completion time for multiple reasons. Aside from the overhead of multithreading, fewer submissions are retrieved in each request, and there is more competition between workers, causing requests to fail.

If you are expecting <1000 results for a query, using less workers (maybe 2-5) may yield faster completion times. You can also refer to the benchmarks to see how pmaw performance improves as the number of results increases.

ReichYang commented 2 years ago

@mattpodolak Hi. Please see my screenshots. It is still considerably slower than just requesting pushshift.io API or using psaw.

Using default number of workers.

Using 2 workers.

Using only one worker.

Since I don't know the number of results for each author, is there a way to dynamically update the workers? I guess most will have only a few results but there might be some active users that post a lot on Reddit.

mattpodolak commented 2 years ago

You might want to have some code to check the number of results available and then use pmaw or psaw accordingly.

If you want to continue tuning the parameters to improve the performance of pmaw, increasing the search_window will help if the user submissions are distributed over many years.

ReichYang commented 2 years ago

@mattpodolak Do we have any functions in PMAW to view the number of results before starting the download?

mattpodolak commented 2 years ago

@ReichYang we do not, unfortunately. you may be able to extract the total results available printed in the PMAW log

mattpodolak / pmaw

Slow running time for scrapping all subs for an author #29