Closed ReichYang closed 2 years ago
Hi @ReichYang can you try running again with the default number of workers (10)? For a small number of results using more workers can slow the completion time for multiple reasons. Aside from the overhead of multithreading, fewer submissions are retrieved in each request, and there is more competition between workers, causing requests to fail.
If you are expecting <1000 results for a query, using less workers (maybe 2-5) may yield faster completion times. You can also refer to the benchmarks to see how pmaw performance improves as the number of results increases.
@mattpodolak Hi. Please see my screenshots. It is still considerably slower than just requesting pushshift.io API or using psaw.
Using default number of workers.
Using 2 workers.
Using only one worker.
Since I don't know the number of results for each author, is there a way to dynamically update the workers? I guess most will have only a few results but there might be some active users that post a lot on Reddit.
You might want to have some code to check the number of results available and then use pmaw or psaw accordingly.
If you want to continue tuning the parameters to improve the performance of pmaw, increasing the search_window
will help if the user submissions are distributed over many years.
@mattpodolak Do we have any functions in PMAW to view the number of results before starting the download?
@ReichYang we do not, unfortunately. you may be able to extract the total results available printed in the PMAW log
Hi. I'm trying to use pmaw to iterate a list of usernames and scrape all their submissions and comments.
It turns out to be very slow.
Using pmaw:
It took 7s to scrape 300+ submission
Using pmaw, it took several minutes to finish it:
Is there something I'm not doing it right, or it's not recommended to do this kind of task with pmaw? From the logging, it seems like the each requests are requesting only several records? Does it has to do with search windows? If so, how to configure for this kind of tasks.