openwpm / OpenWPM

A web privacy measurement framework
https://openwpm.readthedocs.io
Other
1.32k stars 312 forks source link

XPI files not deleted from /tmp #1090

Open ndanner-wesleyancs opened 4 weeks ago

ndanner-wesleyancs commented 4 weeks ago

Preface: I don't think this is an OpenWPM bug, but I'm wondering if anybody has suggestions on how to work around it.

Platform: OpenWPM v0.28, Selenium 4.21.0, Geckodriver 0.34.0.

When the extension is added to a deployed Firefox instance in deploy_firefox.py, line 161, a copy of the XPI file is made in /tmp, presumably by either Selenium or geckodriver. The problem is that it is not deleted.

That's a problem, since when I was crawling a large number of sites, I ended up with enough copies of the XPI file in /tmp that it filled up.

As I said, I don't think this is the fault of OpenWPM. My understanding is that Selenium's WebDriver.quit() is supposed to delete temporary files. It does appear to delete profile directories, just not XPI files.

Has anyone else seen this? And if so, any suggested workarounds?

vringar commented 4 weeks ago

Hey, this sounds like smt that should be done in the shutdown code of the browser manager. Please file a pull request Please be aware however, that the Firefox version used by OpenWPM is severely out of date and before running any large scale crawls you should probably take the steps described under docs/Release-Checklist to look closer to normal web traffic

ndanner-wesleyancs commented 3 weeks ago

I haven't been able to find anything in the Selenium docs that would let me get the name of the temporary file that is created by install_addon in order to delete it, though I won't say that I've done a truly exhaustive search.

This issue seems likely related to https://github.com/SeleniumHQ/selenium/issues/10841, and in the comments therein, one of the posters relates it to https://github.com/mozilla/geckodriver/issues/299.

ndanner-wesleyancs commented 3 weeks ago

Thanks for pointing out the issue with the FF version; I'll be sure to take that into account.

ndanner-wesleyancs commented 3 weeks ago

Just to be sure that this isn't caused by something I'm doing in my crawl script, here is what I think is a pretty minimal crawl script that demonstrates the undesired behavior:

"""
MWE to demonstrate that a copy of the OpenWPM extension XPI file is left behind
in /tmp each time a fresh browser is invoked.

N. Danner
"""

import pathlib

from openwpm.command_sequence import CommandSequence
from openwpm.commands.browser_commands import GetCommand,ScreenshotFullPageCommand
from openwpm.config import BrowserParams, ManagerParams
from openwpm.storage.sql_provider import SQLiteStorageProvider
from openwpm.storage.leveldb import LevelDbProvider
from openwpm.task_manager import TaskManager

NUM_BROWSERS = 1

manager_params = ManagerParams(num_browsers=NUM_BROWSERS)
manager_params.data_directory = pathlib.PosixPath('/tmp/mwe-crawl')
manager_params.log_path = manager_params.data_directory.joinpath("openwpm.log")

browser_params = [
    BrowserParams() for _ in range(NUM_BROWSERS)
]

sqlite_db_provider = \
    SQLiteStorageProvider(
        manager_params.data_directory.joinpath("crawl-data.sqlite"))

with TaskManager(
    manager_params,
    browser_params,
    sqlite_db_provider,
    None
) as manager:
    pass

After running it, there is an XPI file leftover in /tmp/. Here is the result of running the script, with some of the OpenWPM logging elided:

OpenWPM$ ls -l /tmp/addon*
ls: cannot access '/tmp/addon*': No such file or directory
OpenWPM$ python3 mwe.py 
browser_manager      - INFO     - BROWSER 4022973883: Launching browser...
storage_controller   - INFO     - Initializing new handler for TaskManager
storage_controller   - INFO     - Initializing new handler for Browser-4022973883
storage_controller   - INFO     - Initializing new handler for StorageControllerHandle
storage_controller   - INFO     - Awaiting all tasks for visit_id -1
task_manager         - INFO     - 

OpenWPM Version: b'v0.28.0'
Firefox Version: b'126.0.1'

...

========== Input profile tar files ==========
  No profile tar files specified

========== Output (archive) profile dirs ==========
  No profile archive directories specified

storage_controller   - INFO     - Terminating handler for StorageControllerHandle, because the underlying socket closed
storage_controller   - INFO     - Terminating handler for Browser-4022973883, because the underlying socket closed
storage_controller   - INFO     - Terminating handler for TaskManager, because the underlying socket closed
storage_controller   - INFO     - Received shutdown signal!
storage_controller   - INFO     - Closing Server
storage_controller   - INFO     - Closed Server
storage_controller   - INFO     - Cancelling status_queue_update
storage_controller   - INFO     - Cancelled status_queue_update
storage_controller   - INFO     - Cancelling timeout_check
storage_controller   - INFO     - Cancelled timeout_check
storage_controller   - INFO     - Starting wait_closed
storage_controller   - INFO     - Completed wait_closed
storage_controller   - INFO     - Entering self.shutdown
storage_controller   - INFO     - structured_storage is shut down
Shutdown took 11.190859317779541 seconds
OpenWPM$ ls -l /tmp/addon*
-rw-rw---- 1 ndanger ndanger 130892 Jun 10 12:05 /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi
OpenWPM$ diff -s /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi Extension/openwpm.xpi 
Files /tmp/addon-7d648897-3ebe-447d-9697-d662d428b7d2.xpi and Extension/openwpm.xpi are identical
ndanner-wesleyancs commented 3 weeks ago

So right now my workaround, which seems to do the job, is to:

ndanner-wesleyancs commented 3 weeks ago

Right now I create the temporary directory inBrowserManagerHandle.launch_browser_manager and delete it in BrowserManagerHandle.shutdown_browser. If that doesn't sound right, let me know.

vringar commented 1 week ago

Hey, sorry for the delayed response. I think you should handle this entirely within the BrowserManager. Create and save the temp dir before Geckodriver gets started: https://github.com/openwpm/OpenWPM/blob/566d03b6317ff29a6fd5f0e427b06b80c29d695e/openwpm/browser_manager.py#L732-L742 and then do the deletion in the finally block https://github.com/openwpm/OpenWPM/blob/566d03b6317ff29a6fd5f0e427b06b80c29d695e/openwpm/browser_manager.py#L824-L827

ndanner-wesleyancs commented 1 week ago

OK, I will do as you suggest. One thing that I also have to do is to modify TMPDIR in the environment before invoking geckodriver. My inclination is to do that as close to the invocation as possible, which means in deploy_firefox.deploy_firefox. Is there an obvious problem with that?

vringar commented 1 week ago

There is no problem with that except that you will need to pass the name of the temp dir back up to the BrowserManager store that information until it is time to clean up

ndanner-wesleyancs commented 1 week ago

Yes, I am already doing that.

ndanner-wesleyancs commented 1 week ago

Also, why not do the configuration in BrowserManagerHandle? I'm really just doing some "pattern matching," and it looks like BrowserManagerHandle.launch_browser_manager is responsible for setting up the profile directory, which seems like a comparable task.

Similarly, BrowserManagerHandle.shutdown_browser is responsible for deleting the current profile directory, and that also seems analogous.