johndpjr / AgTern

19 stars 5 forks source link

Scale scraper to multiple threads #133

Open johndpjr opened 1 year ago

johndpjr commented 1 year ago

Context

Right now, our scraper's speed is heavily limited since we are only using one! Allowing the scraper to scale to N threads will dramatically increase performance.

TODO

Notes

Be careful of race conditions and whatnot here! Multi-threading adds a lot of additional complexity and bugs that one might initially overlook. Some more ideas for improving the speed would have a scraper scrape other sites while it waits for the crawl delay to expire on a different company site (i.e. asynchronous requests).

JeremyEastham commented 1 year ago

Another note: I think the current logging system still isn't thread-safe? I believe when I tried to run the GUI and the scraper at the same time that either the logs from one process were lost or they were combined with the logs from the other process into a garbled mess. Child processes should probably communicate with the parent process with an inter-thread/process logging queue. Logs should include what process sent them (scraper, web server, db, etc). The database should work well with multithreading/multiprocessing by default. If any files are modified, this could be an issue.

Also, the Python APIs are identical for multiprocessing and multithreading, but multiprocessing is more performant. However, multiprocessing uses more memory because each process has a separate Python instance. Each thread/process will also need a separate Chrome, a notorious memory hog. There may be a way for each process to have a different Chrome tab or window in the same instance, but coordinating these may still use a similar amount of memory while requiring much more process coordination. Our current server will probably need more RAM to facilitate multiple scrapers. Again, processes can be swapped for threads easily later if we decide to.