scrapfly / scrapfly-scrapers

Web scrapers for popular targets powered Scrapfly.io
https://scrapfly.io
Other
169 stars 46 forks source link

Trouble following instructions #3

Closed esaumell closed 9 months ago

esaumell commented 10 months ago

Hi there,

I'm having trouble following instructions from https://github.com/scrapfly/scrapfly-scrapers/tree/main/bookingcom-scraper At point 2 when trying git clone git@github.com:scrapfly/scrapfly-scrapers.git

I got:

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

As a workaround I did git clone https://github.com/scrapfly/scrapfly-scrapers.git

Next step is poetry install . I got: -bash: poetry: command not found

I go to https://scrapfly.io/blog/how-to-scrape-bookingcom/ and as stated, I try pip install "httpx[http2,brotli]" parsel I got: -bash: pip: command not found

As a workaround I did sudo apt install python3-pip and then pip install "httpx[http2,brotli]" parsel

I got:

error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
    python3-xyz, where xyz is the package you are trying to
    install.

    If you wish to install a non-Debian-packaged Python package,
    create a virtual environment using python3 -m venv path/to/venv.
    Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
    sure you have python3-full installed.

    If you wish to install a non-Debian packaged Python application,
    it may be easiest to use pipx install xyz, which will manage a
    virtual environment for you. Make sure you have pipx installed.

    See /usr/share/doc/python3.11/README.venv for more information.

note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.

After searching how to install poetry I did: curl -sSL https://install.python-poetry.org | python3 -, added Poetry's bin directory in my PATH environment and tried again poetry install . I got:

Creating virtualenv scrapfly-booking-tyZw0pBk-py3.11 in /home/xxxx/.cache/pypoetry/virtualenvs

No arguments expected for "install" command, got "."

I decided to go for step 3: poetry run python run.py

I got:

Traceback (most recent call last):
  File "/home/esaumell/scrapfly-scrapers/bookingcom-scraper/run.py", line 12, in <module>
    import bookingcom
  File "/home/esaumell/scrapfly-scrapers/bookingcom-scraper/bookingcom.py", line 19, in <module>
    from loguru import logger as log
ModuleNotFoundError: No module named 'loguru'

So brave and I went for step 4: poetry install --with dev I got:

Installing dependencies from lock file

Package operations: 67 installs, 1 update, 0 removals

  • Installing certifi (2023.5.7)
  • Installing charset-normalizer (3.1.0)
  • Installing idna (3.4)
  • Installing pycparser (2.21)
  • Installing six (1.16.0)
  • Installing urllib3 (2.0.2)
  • Installing attrs (23.1.0)
  • Installing cffi (1.15.1)
  • Installing cssselect (1.2.0)
  • Installing jmespath (1.0.1)
  • Installing isodate (0.6.1)
  • Installing lxml (4.9.2)
  • Installing packaging (23.1)
  • Installing pyasn1 (0.5.0)
  • Installing pyparsing (3.0.9)
  • Installing requests (2.31.0)
  • Downgrading setuptools (68.1.2 -> 67.8.0)
  • Installing soupsieve (2.4.1)
  • Installing w3lib (2.1.1)
  • Installing webencodings (0.5.1)
  • Installing automat (22.10.0)
  • Installing beautifulsoup4 (4.12.2)
  • Installing constantly (15.1.0)
  • Installing cryptography (40.0.2)
  • Installing filelock (3.12.0)
  • Installing html5lib (1.1)
  • Installing hyperlink (21.0.0)
  • Installing incremental (22.10.0)
  • Installing itemadapter (0.8.0)
  • Installing parsel (1.8.1)
  • Installing pyasn1-modules (0.3.0)
  • Installing rdflib (6.3.2)
  • Installing requests-file (1.5.1)
  • Installing typing-extensions (4.6.1)
  • Installing zope-interface (6.0)
  • Installing html-text (0.5.2)
  • Installing iniconfig (2.0.0)
  • Installing itemloaders (1.1.0)
  • Installing jstyleson (0.0.2)
  • Installing mf2py (1.1.2)
  • Installing pluggy (1.0.0)
  • Installing protego (0.2.1)
  • Installing pydispatcher (2.0.7)
  • Installing pyopenssl (23.1.1)
  • Installing pyrdfa3 (3.5.3)
  • Installing queuelib (1.6.2)
  • Installing service-identity (21.1.0)
  • Installing tldextract (3.4.4)
  • Installing twisted (22.10.0)
  • Installing backoff (2.2.1)
  • Installing brotlipy (0.7.0)
  • Installing cchardet (2.1.7): Failed

  ChefBuildError

  Backend subprocess exited when trying to invoke build_wheel

  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-aarch64-cpython-311
  creating build/lib.linux-aarch64-cpython-311/cchardet
  copying src/cchardet/version.py -> build/lib.linux-aarch64-cpython-311/cchardet
  copying src/cchardet/__init__.py -> build/lib.linux-aarch64-cpython-311/cchardet
  running build_ext
  building 'cchardet._cchardet' extension
  creating build/temp.linux-aarch64-cpython-311
  creating build/temp.linux-aarch64-cpython-311/src
  creating build/temp.linux-aarch64-cpython-311/src/cchardet
  creating build/temp.linux-aarch64-cpython-311/src/ext
  creating build/temp.linux-aarch64-cpython-311/src/ext/uchardet
  creating build/temp.linux-aarch64-cpython-311/src/ext/uchardet/src
  creating build/temp.linux-aarch64-cpython-311/src/ext/uchardet/src/LangModels
  aarch64-linux-gnu-gcc -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -Isrc/ext/uchardet/src -I/tmp/tmp2edcn7ai/.venv/include -I/usr/include/python3.11 -c src/cchardet/_cchardet.cpp -o build/temp.linux-aarch64-cpython-311/src/cchardet/_cchardet.o
  src/cchardet/_cchardet.cpp:4:10: fatal error: Python.h: No such file or directory
      4 | #include "Python.h"
        |          ^~~~~~~~~~
  compilation terminated.
  error: command '/usr/bin/aarch64-linux-gnu-gcc' failed with exit code 1

  at ~/.local/share/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chef.py:147 in _prepare
      143│ 
      144│                 error = ChefBuildError("\n\n".join(message_parts))
      145│ 
      146│             if error is not None:
    → 147│                 raise error from None
      148│ 
      149│             return path
      150│ 
      151│     def _prepare_sdist(self, archive: Path, destination: Path | None = None) -> Path:

Note: This error originates from the build backend, and is likely not a problem with poetry but with cchardet (2.1.7) not supporting PEP 517 builds. You can verify this by running 'pip wheel --use-pep517 "cchardet (==2.1.7)"'.

  • Installing click (8.1.3)
  • Installing decorator (5.1.1)
  • Installing extruct (0.14.0)
  • Installing loguru (0.7.0)
  • Installing msgpack (1.0.5)
  • Installing mypy-extensions (1.0.0)
  • Installing pathspec (0.11.1)
  • Installing platformdirs (3.5.1)
  • Installing pytest (7.3.1)
  • Installing python-dateutil (2.8.2)
  • Installing scrapy (2.9.0)

As stated on the output I tried pip wheel --use-pep517 "cchardet (==2.1.7)" I got:

Collecting cchardet==2.1.7
  Downloading cchardet-2.1.7.tar.gz (653 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 653.6/653.6 kB 13.5 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: cchardet
  Building wheel for cchardet (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for cchardet (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [23 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-aarch64-cpython-311
      creating build/lib.linux-aarch64-cpython-311/cchardet
      copying src/cchardet/version.py -> build/lib.linux-aarch64-cpython-311/cchardet
      copying src/cchardet/__init__.py -> build/lib.linux-aarch64-cpython-311/cchardet
      running build_ext
      building 'cchardet._cchardet' extension
      creating build/temp.linux-aarch64-cpython-311
      creating build/temp.linux-aarch64-cpython-311/src
      creating build/temp.linux-aarch64-cpython-311/src/cchardet
      creating build/temp.linux-aarch64-cpython-311/src/ext
      creating build/temp.linux-aarch64-cpython-311/src/ext/uchardet
      creating build/temp.linux-aarch64-cpython-311/src/ext/uchardet/src
      creating build/temp.linux-aarch64-cpython-311/src/ext/uchardet/src/LangModels
      aarch64-linux-gnu-gcc -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -Isrc/ext/uchardet/src -I/usr/include/python3.11 -c src/cchardet/_cchardet.cpp -o build/temp.linux-aarch64-cpython-311/src/cchardet/_cchardet.o
      src/cchardet/_cchardet.cpp:4:10: fatal error: Python.h: No such file or directory
          4 | #include "Python.h"
            |          ^~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/aarch64-linux-gnu-gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for cchardet
Failed to build cchardet
ERROR: Failed to build one or more wheels

This is running on a fresh Debian 12 setup and I have installed poetry 1.6.1 Searching for the problem I'm suggested to try to downgrade poetry, so I try poetry self update 1.4 but poetry install --with dev fails again on cchardet

I also tried installing python3-dev without success. I don't know where to go from here. Any help would be appreciated.

esaumell commented 10 months ago

We also tried the instructions on a fresh Debian 11 but when running poetry install . we got:

The currently activated Python version 3.9.2 is not supported by the project (^3.10).
Trying to find and use a compatible version. 

Poetry was unable to find a compatible version. If you have one, you can explicitly use it via the "env use" command.
esaumell commented 10 months ago

Searching for a solution, I found that Ubuntu 22.04 comes with python 3.10 preinstalled, so I gave it a try. After installing python3-dev the command poetry install --with dev works without errors.

Granitosaurus commented 10 months ago

Hey @esaumell, thank for opening an issue! I'm working on improving setup notes to be more beginner friendly so thanks for the feedback and this should be addressed in the next few commits :)

Granitosaurus commented 9 months ago

Documentation and project files have been improved to address these issues :+1:

esaumell commented 9 months ago
~$ rm -fr scrapfly-scrapers

~$ git clone https://github.com/scrapfly/scrapfly-scrapers.git
Cloning into 'scrapfly-scrapers'...
remote: Enumerating objects: 618, done.
remote: Counting objects: 100% (236/236), done.
remote: Compressing objects: 100% (167/167), done.
remote: Total 618 (delta 136), reused 143 (delta 65), pack-reused 382
Receiving objects: 100% (618/618), 951.42 KiB | 8.06 MiB/s, done.
Resolving deltas: 100% (324/324), done.

~$ cd scrapfly-scrapers/bookingcom-scraper
~/scrapfly-scrapers/bookingcom-scraper$ poetry install .

No arguments expected for "install" command, got "."
esaumell commented 9 months ago

On step 3 ~/scrapfly-scrapers/bookingcom-scraper$ poetry run python run.py and ~/scrapfly-scrapers/bookingcom-scraper$ poetry install --with dev work fine. On step 4 ~/scrapfly-scrapers/bookingcom-scraper$ poetry run pytest test.py gives 4 deprecation warnings about invalid escape sequences. But I guess that's ok.

$ poetry run pytest test.py -k test_hotel_scraping
$ poetry run pytest test.py -k test_search_scraping

Are also ok, but would be nice to know what does that run test do.