scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.92k stars 569 forks source link

Error on windows when trying to deploy - [WinError 32] #494

Closed amartins2imp closed 11 months ago

amartins2imp commented 11 months ago

I have successfully installed and run scrapyd on Windows. However, when i try to deploy to scrapyd I have the following error:

Traceback (most recent call last):
  File "C:\Users\myuser\AppData\Local\pypoetry\Cache\virtualenvs\myproject-J0q5INJf-py3.11\Lib\site-packages\scrapyd\runner.py", line 35, in project_environment
    yield
  File "C:\Users\myuser\AppData\Local\pypoetry\Cache\virtualenvs\myproject-J0q5INJf-py3.11\Lib\site-packages\scrapyd\runner.py", line 45, in main
    execute()
  File "C:\Users\myuser\AppData\Local\pypoetry\Cache\virtualenvs\myproject-J0q5INJf-py3.11\Lib\site-packages\scrapy\cmdline.py", line 162, in execute
    sys.exit(cmd.exitcode)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 198, in _run_module_as_main
  File "", line 88, in _run_code
  File "C:\Users\myuser\AppData\Local\pypoetry\Cache\virtualenvs\myproject-J0q5INJf-py3.11\Lib\site-packages\scrapyd\runner.py", line 49, in 
    main()
  File "C:\Users\myuser\AppData\Local\pypoetry\Cache\virtualenvs\myproject-J0q5INJf-py3.11\Lib\site-packages\scrapyd\runner.py", line 43, in main
    with project_environment(project):
  File "C:\Users\myuser\.pyenv\pyenv-win\versions\3.11.5\Lib\contextlib.py", line 155, in exit
    self.gen.throw(typ, value, traceback)
  File "C:\Users\myuser\AppData\Local\pypoetry\Cache\virtualenvs\myproject-J0q5INJf-py3.11\Lib\site-packages\scrapyd\runner.py", line 38, in project_environment
    os.remove(eggpath)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\myuser\\AppData\\Local\\Temp\\myproject-r23-8_8_ghhi.egg'

I have tried with simple scrapyd API (curl http://localhost:6800/addversion.json -F project=myproject -F version=r23 -F egg=@myproject.egg) and with scrapy-deploy from scrapy-client.

I am using Windows 11 with python 3.11.4

Any help will be appreciated!

jpmckinney commented 11 months ago

Hmm

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\myuser\AppData\Local\Temp\myproject-r23-8_8_ghhi.egg'

Can you check that you aren't running multiple Scrapyd processes?

amartins2imp commented 11 months ago

Hmm

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\myuser\AppData\Local\Temp\myproject-r23-8_8_ghhi.egg'

Can you check that you aren't running multiple Scrapyd processes?

Yeah, i wasn't. I even rebooted my pc and retried everything with a clean environment

jpmckinney commented 11 months ago

Does Windows have a utility to determine which processes are using a given file?

On Linux, lsof can be used for this purpose.

I don't think Scrapyd is causing the issue, as only one process would be trying to access the egg.

sanzenwin commented 11 months ago

Same issue in win11, Inserting import time;time.sleep(5000) before os.remove(eggpath), and then use Microsoft PowerToys / File Locksmith to check the egg file, It proving Scrapyd do caused this issue

jpmckinney commented 11 months ago

I've committed a fix to HEAD. Can you test with the version of Scrapyd from GitHub?

sanzenwin commented 11 months ago

It seems to have no relation to tempfile, but to try finally

import os
import shutil
import sys
import tempfile
# from contextlib import contextmanager

from scrapy.utils.misc import load_object

from scrapyd import Config
from scrapyd.eggutils import activate_egg

def project_environment(project):
    eggversion = os.environ.get('SCRAPYD_EGG_VERSION', None)
    config = Config()
    eggstorage_path = config.get(
        'eggstorage', 'scrapyd.eggstorage.FilesystemEggStorage'
    )
    eggstorage_cls = load_object(eggstorage_path)
    eggstorage = eggstorage_cls(config)

    version, eggfile = eggstorage.get(project, eggversion)
    if eggfile:
        prefix = '%s-%s-' % (project, version)
        f = tempfile.NamedTemporaryFile(suffix='.egg', prefix=prefix, delete=False)
        shutil.copyfileobj(eggfile, f)
        f.close()
        activate_egg(f.name)
    else:
        f = None
    return f
    # try:
    #     assert 'scrapy.conf' not in sys.modules, "Scrapy settings already loaded"
    #     yield
    # finally:
    #     if f:
    #         os.remove(f.name)

def main_finally():
    project = os.environ['SCRAPY_PROJECT']
    f = None
    try:
        f = project_environment(project)
        from scrapy.cmdline import execute
        execute()
    finally:
        if f:
            os.remove(f.name)

def main():
    project = os.environ['SCRAPY_PROJECT']
    f = None
    f = project_environment(project)
    from scrapy.cmdline import execute
    execute()
    if f:
        os.remove(f.name)

if __name__ == '__main__':
    main() # work fine
    # main_finally()  # rasie
jpmckinney commented 11 months ago

They are connected - if the tempfile is never created, then it can never be removed.

Anyway, can you add eggfile.close() after f.close() to see what happens?

jpmckinney commented 11 months ago

The error doesn’t occur when you remove exception handling, because Scrapy raises SystemExit, which causes the process to end - but we’re trying to capture that

sanzenwin commented 11 months ago

@jpmckinney

import os
import sys
import tempfile
import shutil
import operator
import functools
import pkg_resources
import itertools
from importlib.metadata._itertools import unique_everseen
from importlib.metadata import distributions

def activate_egg(eggpath):
    """Activate a Scrapy egg file. This is meant to be used from egg runners
    to activate a Scrapy egg file. Don't use it from other code as it may
    leave unwanted side effects.
    """
    try:
        d = next(pkg_resources.find_distributions(eggpath))
    except StopIteration:
        raise ValueError("Unknown or corrupt egg")
    d.activate()
    settings_module = d.get_entry_info('scrapy', 'settings').module_name
    os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_module)

def main():
    eggfile = open('./0_1_0.egg', 'rb')

    f = None
    try:
        f = tempfile.NamedTemporaryFile(suffix='.egg', delete=False)
        shutil.copyfileobj(eggfile, f)
        activate_egg(f.name)
        f.close()

        # from scrapy.cmdline import execute
        # execute(['C:\\Users\\Sanze\\AppData\\Local\\pdm\\pdm\\Cache\\packages\\scrapyd-1.4.2-py2.py3-none-any\\lib\\scrapyd\\runner.py', 'list', '-s', 'LOG_STDOUT=0'])

        # traceback
        #
        # scrapy.cmdline.execute
        # |
        # scrapy.cmdline._get_commands_dict
        # |
        # scrapy.cmdline._get_commands_from_entry_points
        # |
        # importlib.metadata.entry_points

        norm_name = operator.attrgetter('_normalized_name')
        unique = functools.partial(unique_everseen, key=norm_name)

        list(dist.entry_points for dist in unique(distributions()))
        #### cause PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

    finally:
        if f:
            os.remove(f.name)

if __name__ == '__main__':
    main()
jpmckinney commented 11 months ago

@sanzenwin

As I requested, please test by add eggfile.close() after f.close(), to see what happens.

jpmckinney commented 11 months ago

Please test HEAD again – I added that line myself, and also got rid of the temporary file.

sanzenwin commented 11 months ago

@sanzenwin

As I requested, please test by add eggfile.close() after f.close(), to see what happens.

I had already tired it, it still threw that error. The key is list(dist.entry_points for dist in unique(distributions())) , which used by scrapy.cmdline.execute, please check my posted code.

jpmckinney commented 11 months ago

That line must be opening the file a second time. But the error must be that it’s opened a first time somewhere else. I doubt the Python standard library (importlib) has a Windows error that opens files twice on its own.

Can you try the new HEAD from GitHub?

sanzenwin commented 11 months ago

I checked the HEAD, having no changes. Did you have any committed code.

jpmckinney commented 11 months ago

Ah, sorry, I forgot to push: you can try now.

sanzenwin commented 11 months ago

Ah, sorry, I forgot to push: you can try now.

I have tried it, it works.

jpmckinney commented 11 months ago

Thank you for confirming!