scrapinghub / shub

Scrapinghub Command Line Client
https://shub.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
125 stars 80 forks source link

Executing commands on scrapinghub cloud #430

Open hamzza-K opened 1 year ago

hamzza-K commented 1 year ago

Hi, I'm trying to deploy my spider that uses playwright (scrapy-playwright for integration). I have the following configuration: scrapinghub.yml

requirements:
  file: requirements.txt
cmd:
- export PATH=/app/python/bin:$PATH
- playwright install
- playwright install-deps

I can see that modules get successfully installed in the deploy logs but how can I execute the following commands since playwright needs it after fresh installation. I couldn't find anything related to this in all of the zyte documentation.

elacuesta commented 1 year ago

You need to deploy a custom Docker image in order to do have arbitrary commands executed.

Regarding scrapy-playwright, I have a sample project that demonstrates how to use it on Scrapy Cloud. Disclaimer: this is a personal project, it is NOT an officially supported Scrapy stack.

hamzza-K commented 1 year ago

I tried to hack a workaround by messing with the scripts argument using setup.py

script.py


import subprocess
def run_bash_command(command):
    try:
        result = subprocess.run(command, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        return result.stdout
    except subprocess.CalledProcessError as e:
        print(f"Error running command: {e}")
        return None

commands = ["export PATH=/app/python/bin:$PATH", "playwright install", "playwright install-deps"]

for command in commands:
    output = run_bash_command(command)
    if output is not None:
        print("Command output:")
        print(output)``` 

I'm not sure why this isn't setting the path right to correct the following error in the logs.
  `WARNING: The scripts pip, pip3 and pip3.8 are installed in '/app/python/bin' which is not on PATH.`
0: 2023-07-26 12:39:49 INFO Log opened.
1: 2023-07-26 12:39:50 INFO [stdout] Command output:
2: 2023-07-26 12:39:50 INFO [stdout]
3: 2023-07-26 12:39:50 INFO [stdout] Error running command: Command 'playwright install' returned non-zero exit status 127.
4: 2023-07-26 12:39:50 INFO [stdout] Error running command: Command 'playwright install-deps' returned non-zero exit status 127.


This works fine on my local environment. Can you explain why this won't work here?

Thanks for sharing the Dockerfile setup, I'm afraid that's the only way for me to get what I want.  I really wish there was a feature of opening a bash shell just like in Heroku.