Open ndanner-wesleyancs opened 4 weeks ago
Hey, this sounds like smt that should be done in the shutdown code of the browser manager. Please file a pull request Please be aware however, that the Firefox version used by OpenWPM is severely out of date and before running any large scale crawls you should probably take the steps described under docs/Release-Checklist to look closer to normal web traffic
I haven't been able to find anything in the Selenium docs that would let me get the name of the temporary file that is created by install_addon
in order to delete it, though I won't say that I've done a truly exhaustive search.
This issue seems likely related to https://github.com/SeleniumHQ/selenium/issues/10841, and in the comments therein, one of the posters relates it to https://github.com/mozilla/geckodriver/issues/299.
Thanks for pointing out the issue with the FF version; I'll be sure to take that into account.
Just to be sure that this isn't caused by something I'm doing in my crawl script, here is what I think is a pretty minimal crawl script that demonstrates the undesired behavior:
"""
MWE to demonstrate that a copy of the OpenWPM extension XPI file is left behind
in /tmp each time a fresh browser is invoked.
N. Danner
"""
import pathlib
from openwpm.command_sequence import CommandSequence
from openwpm.commands.browser_commands import GetCommand,ScreenshotFullPageCommand
from openwpm.config import BrowserParams, ManagerParams
from openwpm.storage.sql_provider import SQLiteStorageProvider
from openwpm.storage.leveldb import LevelDbProvider
from openwpm.task_manager import TaskManager
NUM_BROWSERS = 1
manager_params = ManagerParams(num_browsers=NUM_BROWSERS)
manager_params.data_directory = pathlib.PosixPath('/tmp/mwe-crawl')
manager_params.log_path = manager_params.data_directory.joinpath("openwpm.log")
browser_params = [
BrowserParams() for _ in range(NUM_BROWSERS)
]
sqlite_db_provider = \
SQLiteStorageProvider(
manager_params.data_directory.joinpath("crawl-data.sqlite"))
with TaskManager(
manager_params,
browser_params,
sqlite_db_provider,
None
) as manager:
pass
After running it, there is an XPI file leftover in /tmp/. Here is the result of running the script, with some of the OpenWPM logging elided:
OpenWPM$ ls -l /tmp/addon*
ls: cannot access '/tmp/addon*': No such file or directory
OpenWPM$ python3 mwe.py
browser_manager - INFO - BROWSER 4022973883: Launching browser...
storage_controller - INFO - Initializing new handler for TaskManager
storage_controller - INFO - Initializing new handler for Browser-4022973883
storage_controller - INFO - Initializing new handler for StorageControllerHandle
storage_controller - INFO - Awaiting all tasks for visit_id -1
task_manager - INFO -
OpenWPM Version: b'v0.28.0'
Firefox Version: b'126.0.1'
...
========== Input profile tar files ==========
No profile tar files specified
========== Output (archive) profile dirs ==========
No profile archive directories specified
storage_controller - INFO - Terminating handler for StorageControllerHandle, because the underlying socket closed
storage_controller - INFO - Terminating handler for Browser-4022973883, because the underlying socket closed
storage_controller - INFO - Terminating handler for TaskManager, because the underlying socket closed
storage_controller - INFO - Received shutdown signal!
storage_controller - INFO - Closing Server
storage_controller - INFO - Closed Server
storage_controller - INFO - Cancelling status_queue_update
storage_controller - INFO - Cancelled status_queue_update
storage_controller - INFO - Cancelling timeout_check
storage_controller - INFO - Cancelled timeout_check
storage_controller - INFO - Starting wait_closed
storage_controller - INFO - Completed wait_closed
storage_controller - INFO - Entering self.shutdown
storage_controller - INFO - structured_storage is shut down
Shutdown took 11.190859317779541 seconds
OpenWPM$ ls -l /tmp/addon*
-rw-rw---- 1 ndanger ndanger 130892 Jun 10 12:05 /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi
OpenWPM$ diff -s /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi Extension/openwpm.xpi
Files /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi and Extension/openwpm.xpi are identical
So right now my workaround, which seems to do the job, is to:
browser_params
;TMPDIR
in the environment before creating the geckodriver
service, because it looks like geckodriver
uses TMPDIR
for its temporary directory.Right now I create the temporary directory inBrowserManagerHandle.launch_browser_manager
and delete it in BrowserManagerHandle.shutdown_browser
. If that doesn't sound right, let me know.
Hey,
sorry for the delayed response.
I think you should handle this entirely within the BrowserManager
.
Create and save the temp dir before Geckodriver gets started: https://github.com/openwpm/OpenWPM/blob/566d03b6317ff29a6fd5f0e427b06b80c29d695e/openwpm/browser_manager.py#L732-L742
and then do the deletion in the finally
block
https://github.com/openwpm/OpenWPM/blob/566d03b6317ff29a6fd5f0e427b06b80c29d695e/openwpm/browser_manager.py#L824-L827
OK, I will do as you suggest. One thing that I also have to do is to modify TMPDIR
in the environment before invoking geckodriver
. My inclination is to do that as close to the invocation as possible, which means in deploy_firefox.deploy_firefox
. Is there an obvious problem with that?
There is no problem with that except that you will need to pass the name of the temp dir back up to the BrowserManager store that information until it is time to clean up
Yes, I am already doing that.
Also, why not do the configuration in BrowserManagerHandle
? I'm really just doing some "pattern matching," and it looks like BrowserManagerHandle.launch_browser_manager
is responsible for setting up the profile directory, which seems like a comparable task.
Similarly, BrowserManagerHandle.shutdown_browser
is responsible for deleting the current profile directory, and that also seems analogous.
Preface: I don't think this is an OpenWPM bug, but I'm wondering if anybody has suggestions on how to work around it.
Platform: OpenWPM v0.28, Selenium 4.21.0, Geckodriver 0.34.0.
When the extension is added to a deployed Firefox instance in
deploy_firefox.py
, line 161, a copy of the XPI file is made in/tmp
, presumably by either Selenium orgeckodriver
. The problem is that it is not deleted.That's a problem, since when I was crawling a large number of sites, I ended up with enough copies of the XPI file in
/tmp
that it filled up.As I said, I don't think this is the fault of OpenWPM. My understanding is that Selenium's
WebDriver.quit()
is supposed to delete temporary files. It does appear to delete profile directories, just not XPI files.Has anyone else seen this? And if so, any suggested workarounds?