Open johndpjr opened 1 year ago
Another note: I think the current logging system still isn't thread-safe? I believe when I tried to run the GUI and the scraper at the same time that either the logs from one process were lost or they were combined with the logs from the other process into a garbled mess. Child processes should probably communicate with the parent process with an inter-thread/process logging queue. Logs should include what process sent them (scraper, web server, db, etc). The database should work well with multithreading/multiprocessing by default. If any files are modified, this could be an issue.
Also, the Python APIs are identical for multiprocessing and multithreading, but multiprocessing is more performant. However, multiprocessing uses more memory because each process has a separate Python instance. Each thread/process will also need a separate Chrome, a notorious memory hog. There may be a way for each process to have a different Chrome tab or window in the same instance, but coordinating these may still use a similar amount of memory while requiring much more process coordination. Our current server will probably need more RAM to facilitate multiple scrapers. Again, processes can be swapped for threads easily later if we decide to.
Context
Right now, our scraper's speed is heavily limited since we are only using one! Allowing the scraper to scale to N threads will dramatically increase performance.
TODO
-n
that is an integer representing the number of threads that the scraper scales toc
companies will spawnn
threads and assign themc / n
companies to divide the work evenly).Notes
Be careful of race conditions and whatnot here! Multi-threading adds a lot of additional complexity and bugs that one might initially overlook. Some more ideas for improving the speed would have a scraper scrape other sites while it waits for the crawl delay to expire on a different company site (i.e. asynchronous requests).