openwpm / OpenWPM

A web privacy measurement framework
https://openwpm.readthedocs.io
Other
1.34k stars 315 forks source link

Managing Temporary Files #1083

Open MohammadMahdiJavid opened 9 months ago

MohammadMahdiJavid commented 9 months ago

Hi,

i'm running large crawls, but as i noticed temp files are not getting removed as sometime passes or crawls move forward

openwpm_profilearchive{some random number} --> each almost more than 2GB

i was wondering, if i made mistake in my experiments or this feature is not implemented?

Thanks

vringar commented 9 months ago

So from a quick search around I can see the profile.tar getting generated here: https://github.com/openwpm/OpenWPM/blob/f72e7ca1fc3edcc60b26c780c264176e1e384779/openwpm/browser_manager.py#L114-L134 Which then get used here: https://github.com/openwpm/OpenWPM/blob/f72e7ca1fc3edcc60b26c780c264176e1e384779/openwpm/deploy_browsers/deploy_firefox.py#L64-L73

And never cleaned up. Since the recovery_tar is by definition generated by OpenWPM, it should clean up after the browser has been restored after a crash. Doing an os.remove and unsetting browser_params.recovery_tar after it has been restored seems reasonable.

Do you have time to implement this?

MohammadMahdiJavid commented 9 months ago

Hi, Thanks for your time and the great insight provided

https://github.com/openwpm/OpenWPM/blob/f72e7ca1fc3edcc60b26c780c264176e1e384779/openwpm/browser_manager.py#L221-L230

I see here that tempdir get's removed, although the variable name looks very unreadable :) and tempdir is the one used to create the directory

https://github.com/openwpm/OpenWPM/blob/f72e7ca1fc3edcc60b26c780c264176e1e384779/openwpm/browser_manager.py#L116-L121

I think the issue would be from the profiling since it get's removed when spawn is successful and by looking more into the logs I realized

there are different errors like


  File "openwpm/commands/profile_commands.py", line 58, in dump_profile
    tar.add(browser_profile_path, arcname="")

  File "python3.9/tarfile.py", line 2172, in add
    self.add(os.path.join(name, f), os.path.join(arcname, f),

  File "python3.9/tarfile.py", line 2150, in add
    tarinfo = self.gettarinfo(name, arcname)

  File "python3.9/tarfile.py", line 2023, in gettarinfo
    statres = os.lstat(name)

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/firefox_profile_mp57p7k5/prefs-41.js'

or similar errors for other files like

prefs-41.js

storage.sqlite-journal

WebDriverBiDiServer.json

I was wondering when the profile is being dumped, if the previous browser is crashed and closed, right? does it need a few seconds maybe to remove temp files or something like this?

i think this should be the issue of "not removed archived profiles"