openwpm / OpenWPM

A web privacy measurement framework
https://openwpm.readthedocs.io
Other
1.34k stars 315 forks source link

Storage watchdog #1056

Closed vringar closed 1 year ago

vringar commented 1 year ago

Original Implementation done by @gridl0ck. With modifications by @vringar.

codecov[bot] commented 1 year ago

Codecov Report

Attention: 57 lines in your changes are missing coverage. Please review.

Comparison is base (761e46d) 46.20% compared to head (5909090) 45.08%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1056 +/- ## ========================================== - Coverage 46.20% 45.08% -1.13% ========================================== Files 34 35 +1 Lines 3398 3476 +78 ========================================== - Hits 1570 1567 -3 - Misses 1828 1909 +81 ``` | [Files](https://app.codecov.io/gh/openwpm/OpenWPM/pull/1056?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openwpm) | Coverage Δ | | |---|---|---| | [openwpm/config.py](https://app.codecov.io/gh/openwpm/OpenWPM/pull/1056?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openwpm#diff-b3BlbndwbS9jb25maWcucHk=) | `94.69% <100.00%> (+0.16%)` | :arrow_up: | | [openwpm/deploy\_browsers/deploy\_firefox.py](https://app.codecov.io/gh/openwpm/OpenWPM/pull/1056?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openwpm#diff-b3BlbndwbS9kZXBsb3lfYnJvd3NlcnMvZGVwbG95X2ZpcmVmb3gucHk=) | `24.09% <0.00%> (-0.61%)` | :arrow_down: | | [openwpm/browser\_manager.py](https://app.codecov.io/gh/openwpm/OpenWPM/pull/1056?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openwpm#diff-b3BlbndwbS9icm93c2VyX21hbmFnZXIucHk=) | `48.50% <40.00%> (-1.25%)` | :arrow_down: | | [openwpm/task\_manager.py](https://app.codecov.io/gh/openwpm/OpenWPM/pull/1056?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openwpm#diff-b3BlbndwbS90YXNrX21hbmFnZXIucHk=) | `71.24% <60.00%> (-1.20%)` | :arrow_down: | | [openwpm/utilities/storage\_watchdog.py](https://app.codecov.io/gh/openwpm/OpenWPM/pull/1056?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openwpm#diff-b3BlbndwbS91dGlsaXRpZXMvc3RvcmFnZV93YXRjaGRvZy5weQ==) | `23.43% <23.43%> (ø)` | | ... and [4 files with indirect coverage changes](https://app.codecov.io/gh/openwpm/OpenWPM/pull/1056/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openwpm)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

vringar commented 1 year ago

Hey @gridl0ck, When checking out the code locally I noticed a couple of things that made me reconsider some of the choices you had made.

  1. The observer from the watchdog was not being used. It was only created and stopped

  2. Starting a thread and immedetly rejoining it doesn't allow for more concurrency or parallelism. So this work can also just be done in the main process in the execute_command_sequence thread

I also accidentally pushed these changes directly to master and then had to force push over them, since I wasn't sure that all tests were passing. This is why your PR got closed.

gridl0ck commented 1 year ago

Oh @vringar I completely missed that. That is most definitely left over from an early design I had intended to use but I have since moved away from using it. Do you need me to remove it and push those changes?

vringar commented 1 year ago

@gridl0ck I hit send too early and still need to update the previous message with the rest of my feedback I was hoping, it be able to fix my mistake before you saw it :sweat_smile:

Do you need me to remove it and push those changes?

As I have rewritten a large part of your original implementation and have a couple of open questions, I'd rather have you as a reviewer than a contributor.

gridl0ck commented 1 year ago

As I have rewritten a large part of your original implementation and have a couple of open questions, I'd rather have you as a reviewer than a contributor.

Dang ok. Let me know what, if anything, I need to do to get this added because I do think it is a helpful addition.

vringar commented 1 year ago

My primary question right now is: What made you decide to force the checks after every command sequence?

The memory_watchdog just checks at a random time, sets the flag and then the BrowserManagerHandle checks the flag after a CommandSequence has completed.

Is this Scenario unacceptable to you?:

  1. Profile<Max_Size
  2. CS1 runs to completion
  3. Profile > Max_Size but the Watchdog hasn't noticed that yet
  4. CS2 starts running
  5. Watchdog notices and sets reset=True
  6. CS2 completes
  7. Browser gets restarted

Please note that I'm not disagreeing with doing the checks synchronously after the CS. I might even pull out the memory_watchdog check to the same location, because it makes it easier to reason about what can cause a browser to reset. I'm just genuinely curious.

gridl0ck commented 1 year ago

My primary question right now is: What made you decide to force the checks after every command sequence?

When I created this for my capstone, the amount of data generated by each crawl varied per website so I needed to check the size of the folder. As to why its at the end of the CS, I ran into problems with the StorageController not saving the data to the database before the watchdog got to it (or thats how I interpreted the problem at the time).

The memory_watchdog just checks at a random time, sets the flag and then the BrowserManagerHandle checks the flag after a CommandSequence has completed.

Is this Scenario unacceptable to you:

  1. Profile<Max_Size
  2. CS1 runs to completion
  3. Profile > Max_Size but the Watchdog hasn't noticed that yet
  4. CS2 starts running
  5. Watchdog notices and sets reset=True
  6. CS2 completes
  7. Browser gets restarted

This goes back what I was saying earlier about the StorageController not saving the data because I did have this running asynchronously at first (trying to queue up restarts for when its most convenient) but I didn't know how to communicate that internally so running after each CS ensured that the data from each CS was stored and then if the resulting crawl pushed the profile directory over the threshold, then the browser would be restarted.

I originally had a function that would go in and simply wipe all non-essential files (It wouldnt touch configuration files or anything but it was a very hacky way of cleaning that ended up slowing down everything as time went on) but realized that having the browser just restart after a threshold reached cleared those files and did the necessary setup for each browser because you built that functionality in.

The StorageWatchdog essentially just monitors the size of the browser_profile directories in each of the BrowserManager threads and uses your built-in reset functionality in moderation. Before, when you set the reset flag, you would get a browser restart after every CS, which slowed down our crawls and part of our project was crawling a certain number of websites in a timely manner so this was inconvenient. With the StorageWatchdog, you can let the crawls run with little impact to speed because the browsers arent being reset after each CS, but you can also work with limited space.

vringar commented 1 year ago

Okay, so I wanted to write tests to ensure this functionality keeps working, but seeing as our other two watchdogs also don't have any test and I can't think of a good way to test it (as restarts are supposed to be transparent/invisible to the user anyway) I'll just set this to automerge and way for the tests to pass.

I'll create a new release with this feature in the next couple of days. Thank you for your contribution @gridl0ck !