Closed vringar closed 1 year ago
Attention: 57 lines
in your changes are missing coverage. Please review.
Comparison is base (
761e46d
) 46.20% compared to head (5909090
) 45.08%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hey @gridl0ck, When checking out the code locally I noticed a couple of things that made me reconsider some of the choices you had made.
The observer from the watchdog was not being used. It was only created and stopped
Starting a thread and immedetly rejoining it doesn't allow for more concurrency or parallelism. So this work can also just be done in the main process in the execute_command_sequence
thread
I also accidentally pushed these changes directly to master and then had to force push over them, since I wasn't sure that all tests were passing. This is why your PR got closed.
Oh @vringar I completely missed that. That is most definitely left over from an early design I had intended to use but I have since moved away from using it. Do you need me to remove it and push those changes?
@gridl0ck I hit send too early and still need to update the previous message with the rest of my feedback I was hoping, it be able to fix my mistake before you saw it :sweat_smile:
Do you need me to remove it and push those changes?
As I have rewritten a large part of your original implementation and have a couple of open questions, I'd rather have you as a reviewer than a contributor.
As I have rewritten a large part of your original implementation and have a couple of open questions, I'd rather have you as a reviewer than a contributor.
Dang ok. Let me know what, if anything, I need to do to get this added because I do think it is a helpful addition.
My primary question right now is: What made you decide to force the checks after every command sequence?
The memory_watchdog just checks at a random time, sets the flag and then the BrowserManagerHandle checks the flag after a CommandSequence has completed.
Is this Scenario unacceptable to you?:
reset=True
Please note that I'm not disagreeing with doing the checks synchronously after the CS. I might even pull out the memory_watchdog check to the same location, because it makes it easier to reason about what can cause a browser to reset. I'm just genuinely curious.
My primary question right now is: What made you decide to force the checks after every command sequence?
When I created this for my capstone, the amount of data generated by each crawl varied per website so I needed to check the size of the folder. As to why its at the end of the CS, I ran into problems with the StorageController not saving the data to the database before the watchdog got to it (or thats how I interpreted the problem at the time).
The memory_watchdog just checks at a random time, sets the flag and then the BrowserManagerHandle checks the flag after a CommandSequence has completed.
Is this Scenario unacceptable to you:
- Profile<Max_Size
- CS1 runs to completion
- Profile > Max_Size but the Watchdog hasn't noticed that yet
- CS2 starts running
- Watchdog notices and sets
reset=True
- CS2 completes
- Browser gets restarted
This goes back what I was saying earlier about the StorageController not saving the data because I did have this running asynchronously at first (trying to queue up restarts for when its most convenient) but I didn't know how to communicate that internally so running after each CS ensured that the data from each CS was stored and then if the resulting crawl pushed the profile directory over the threshold, then the browser would be restarted.
I originally had a function that would go in and simply wipe all non-essential files (It wouldnt touch configuration files or anything but it was a very hacky way of cleaning that ended up slowing down everything as time went on) but realized that having the browser just restart after a threshold reached cleared those files and did the necessary setup for each browser because you built that functionality in.
The StorageWatchdog essentially just monitors the size of the browser_profile directories in each of the BrowserManager threads and uses your built-in reset functionality in moderation. Before, when you set the reset flag, you would get a browser restart after every CS, which slowed down our crawls and part of our project was crawling a certain number of websites in a timely manner so this was inconvenient. With the StorageWatchdog, you can let the crawls run with little impact to speed because the browsers arent being reset after each CS, but you can also work with limited space.
Okay, so I wanted to write tests to ensure this functionality keeps working, but seeing as our other two watchdogs also don't have any test and I can't think of a good way to test it (as restarts are supposed to be transparent/invisible to the user anyway) I'll just set this to automerge and way for the tests to pass.
I'll create a new release with this feature in the next couple of days. Thank you for your contribution @gridl0ck !
Original Implementation done by @gridl0ck. With modifications by @vringar.