openwpm / OpenWPM

A web privacy measurement framework
https://openwpm.readthedocs.io
Other
1.34k stars 314 forks source link

dump_profile leads to crash #253

Closed turban1988 closed 3 years ago

turban1988 commented 5 years ago

Hi, From time to time I get the following error and the crawl crashes when the dump_profile command is executed (see below). It happens after several site visits.

profile_commands     - WARNING  - BROWSER 8: /tmp/rust_mozprofile.ouql8cAxDcO7/webapps NOT FOUND IN profile folder, skipping.
profile_commands     - WARNING  - BROWSER 7: /tmp/rust_mozprofile.6O33YUjOoyr4/webapps NOT FOUND IN profile folder, skipping.
profile_commands     - WARNING  - BROWSER 2: /tmp/rust_mozprofile.bLtCtJWCQGbL/webapps NOT FOUND IN profile folder, skipping.
profile_commands     - WARNING  - BROWSER 13: /tmp/rust_mozprofile.71sn24aq3pPd/webapps NOT FOUND IN profile folder, skipping.
profile_commands     - WARNING  - BROWSER 10: /tmp/rust_mozprofile.HEmV10b4ycHd/webapps NOT FOUND IN profile folder, skipping.
profile_commands     - WARNING  - BROWSER 4: /tmp/rust_mozprofile.DnYF1I9lxEvi/webapps NOT FOUND IN profile folder, skipping.
profile_commands     - WARNING  - BROWSER 15: /tmp/rust_mozprofile.cLpMmF3SjTX6/webapps NOT FOUND IN profile folder, skipping.
profile_commands     - WARNING  - BROWSER 14: /tmp/rust_mozprofile.sodQ60ETatbd/webapps NOT FOUND IN profile folder, skipping.
profile_commands     - WARNING  - BROWSER 11: /tmp/rust_mozprofile.Le0cbWAICSPX/webapps NOT FOUND IN profile folder, skipping.
BrowserManager       - INFO     - BROWSER 7: Crash in driver, restarting browser manager 
 Traceback (most recent call last):
  File "/home/openwpm/new_try/OpenWPM/automation/BrowserManager.py", line 362, in BrowserManager
    with open(ep_filename, 'rt') as f:
IOError: [Errno 2] No such file or directory: '/tmp/rust_mozprofile.dXVx1eQzDRJf/extension_port.txt'

BrowserManager       - ERROR    - BROWSER 7: Spawn unsuccessful  | Proxy Ready: False  | Profile Created: True  | Profile Tar: True  | Display: True  | Launch Attempted: True  | Browser Launched: True  | Browser Ready: False 
TaskManager          - INFO     - BROWSER 9: Timeout while executing command, DEEP_BROWSE, killing browser manager
56830 TaskManager          - INFO     - BROWSER 1: Timeout while executing command, DEEP_BROWSE, killing browser manager
TaskManager          - INFO     - BROWSER 5: Timeout while executing command, DEEP_BROWSE, killing browser manager
TaskManager          - INFO     - BROWSER 3: Timeout while executing command, DEEP_BROWSE, killing browser manager
profile_commands     - CRITICAL - BROWSER 1: /tmp/rust_mozprofile.iB7EBmyFCwRK/cookies.sqlite NOT FOUND IN profile folder, skipping.
Traceback (most recent call last):
  File "data_collection.py", line 75, in <module>
    manager.close()
  File "/home/openwpm/new_try/OpenWPM/automation/TaskManager.py", line 574, in close
    self._shutdown_manager()
  File "/home/openwpm/new_try/OpenWPM/automation/TaskManager.py", line 266, in _shutdown_manager
    browser.shutdown_browser(during_init)
  File "/home/openwpm/new_try/OpenWPM/automation/BrowserManager.py", line 325, in shutdown_browser
    save_flash=self.browser_params['disable_flash'] is False
  File "/home/openwpm/new_try/OpenWPM/automation/Commands/profile_commands.py", line 177, in dump_profile
    tar.add(full_path, arcname=item)
  File "/usr/lib/python2.7/tarfile.py", line 2009, in add
    tarinfo = self.gettarinfo(name, arcname)
  File "/usr/lib/python2.7/tarfile.py", line 1881, in gettarinfo
    statres = os.lstat(name)
OSError: [Errno 2] No such file or directory: '/tmp/rust_mozprofile.iB7EBmyFCwRK/cookies.sqlite'

openwpm.log error.txt

englehardt commented 5 years ago

Thanks for the report! This seems to be a known issue related to #161.

turban1988 commented 5 years ago

Is there anything known I can do to avoid this (e.g., less borwser or more resources)?

englehardt commented 5 years ago

Unfortunately we haven't investigated beyond what you see in the bugs, so I don't have advice for avoiding the issue.

I am happy to provide feedback and pointers to the relevant code if you'd like to investigate further. I suspect this is caused by a race condition in how geckodriver handles Firefox profiles. This was introduced after we moved over to geckodriver.

turban1988 commented 5 years ago

Hi, I would be happy if you could give me some pointers where to find the cause of the problem.

turban1988 commented 5 years ago

I ran a measurement with just one browser and ist still appeared.

felix4webscience commented 5 years ago

@turban1988, hey there, as far as I see, you have several issues. May I ask you, how many urls you are crawling? And are you using a modified demo.py version or some files you configurated by yourself? Are you using classic domains or more specific ones?

However, I ran into the same problem after a scan about more than 1000 URLs. Reducing my sample set down to 500 URLs helped to reduce those issues and I thought it might have something to do with cache errors or somethingg like that. I reduced the sample set again and used around 20 random Urls like twitter, facebook, yahoo, and some others, while just using the "demo.py" set up.

As you are saying, the issue shows up only for few domains. In my last example it was the case for the following URLs: http://qq.com http://sina.com.cn http://hao123.com http://yandex.ru

because the error is appearing after a declaration of "Time Out" I changed the following setting: sleep=0 -> sleep=15; timeout=60 -> timeout=360 and I used command_sequence.dump_profile_cookies(360)

I tested it again and again and don't receive any errors anymore.

I am new in that topic and I am not sure, if this solution is suitable for you as well.

turban1988 commented 5 years ago

Hi @felix4webscience, I am currently using a modified openWPM version but the error also occurs when I use the demo.py.

I am not quite sure what you mean by ''calssic'' domains but I am using subsites rather than TLD+1 (e.g., I use https://www.google.com/search?q=openWPM and not https://www.google.com).

I am using a timeout of 120. I will try larger timeouts If I get the time to do so.

felix4webscience commented 5 years ago

Hi @turban1988,

sorry, with "classic" domains I've meant homepages (TLDs). I have few ideas, why this error occurs:

  1. ) Does "geckodriver" or "selenium" maybe check the /robots.txt directory by default? As some webmaster set a timeout and/ or sleep parameter for crawler, depending on the given User-Agent. 2.) If you scan subdomains, also ./robots.txt could be the issue, if 1.) is the case, as some websites Disallow the access to just specific directories.

However, would be interested, if expanding the timeout parameter helped you out, at least for the following error type: /tmp/rust_mozprofile.ouql8cAxDcO7/webapps

turban1988 commented 5 years ago

Hi,

2) is possible but I guess unlikely in my case since I only visit subsites that are linked (i.e., there is a hyperlink to that page) on the frontpage (TLD+1) or linked on subpages.

turban1988 commented 5 years ago

I did set timeout=500 and the crawl did not crash due to the reported error.

felix4webscience commented 5 years ago

Hey,

in your report above I can see different issues:

profile commands: (Warning) 1.) /tmp/rust_mozprofile.ouql8cAxDcO7/webapps NOT FOUND IN profile folder, skipping.

OSError: 1.) /tmp/rust_mozprofile.xxx/cookies.sqlite NOT FOUND IN profile folder, skipping.

IOError: 1.) /tmp/rust_mozprofile.dXVx1eQzDRJf/extension_port.txt

Browser Manager: (Error) 1.) BrowserManager - ERROR - BROWSER 7: Spawn unsuccessful | Proxy Ready: False | Profile Created: True | Profile Tar: True | Display: True | Launch Attempted: True | Browser Launched: True | Browser Ready: False

according to your answer, do you know if the amount of profile command warnings has been reduced? I have no immediate solution, but for Warnings according to "rust_mozprofile" I would search for solutions within the rust community or else for known issues with geckodriver and maybe selenium.

Probably you have to make some test runs. Like reducing your amount of URLs, switching your IP Adress, deleting your cache (just in case). Than you could use the debugger package for python https://docs.python.org/2/library/pdb.html to test, if this issue happens for certain URLs only and if you are able to reproduce this error. If you are using PyCharm, you can run the internal debugger and in case of version/ library dependencies, Pycharm detects incompatibility.

Would be great, if you can share your solution, if you got some in the end. Greetings

vp01020 commented 2 years ago

@felix4webscience, are you still working on this issue? I am a newbie trying to investigate and write a paper using OpenWPM, but I am not really a programmer. I have very limited knowledge of what is going on in the demo.py. I am struggling to run the file and I keep getting the selenium module not found, even though I have tried to set the path for it. The browser keeps failing to launch it says. Help will be appreciated thanks!

felix4webscience commented 2 years ago

Hi, well, no I am just keep following the progress on OpenWPM but not actively working with it anymore. Maybe I can try to help you out. If you are new to programming I would suggest, that in any case you have technical questions, you should add your relevant sysinfo. Which OS are you running? Are you using Ubuntu vers.? Are you using Mac with docker? Which IDE are you using? And which python version? Are you using a virtual machine (e.g. Virtual box, VMware)? Those are important information for any programmer in order to replicate your issue. Hoewever, if Selenium has not been found, it seems, your library has not been correctly installed. Note, that the Import can slightly differ in case of the OS you are using. Maybe it is worthy for you to watch a tutorial about selenium, get used to the basics and run some easy projects first before starting with OpenWPM, which is quite complex? Anyhow, read the manual for selenium, it‘s very enlightening, when starting. I was a newbie as well, when I started and it took me a bunch of python tutorials to catch up🙃. But finally its a great project and worth it.

vp01020 @.***> schrieb am Mi. 28. Sept. 2022 um 00:45:

@felix4webscience https://github.com/felix4webscience, are you still working on this issue? I am a newbie trying to investigate and write a paper using OpenWPM, but I am not really a programmer. I have very limited knowledge of what is going on in the demo.py. I am struggling to run the file and I keep getting the selenium module not found, even though I have tried to set the path for it. The browse keeps failing to launch it says. Help will be appreciated thanks!

— Reply to this email directly, view it on GitHub https://github.com/openwpm/OpenWPM/issues/253#issuecomment-1260150628, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPHRWZPSCITI3X34TSNDWLWAN2J3ANCNFSM4GUV7CPA . You are receiving this because you were mentioned.Message ID: @.***>

vringar commented 2 years ago

@vp01020 Do you want to come into the Matrix chat and we'll talk about your setup problems?

But finally its a great project and worth it.

@felix4webscience thanks for the kind words. 😊