scrapy / scurl

Performance-focused replacement for Python urllib
Apache License 2.0
21 stars 6 forks source link

Installation instructions are wrong #58

Open ddebernardy opened 5 years ago

ddebernardy commented 5 years ago

I was following the install instructions from the README (macOS 10.14.5).

There was one warning about

s3fs 0.2.1 has requirement six>=1.12.0, but you'll have six 1.11.0 which is incompatible.

... which I ignored. And then this failed:

[...]
$ make build_ext
python setup.py build_ext --inplace
Compiling scurl/cgurl.pyx because it changed.
Compiling scurl/canonicalize.pyx because it changed.
[1/2] Cythonizing scurl/canonicalize.pyx
[2/2] Cythonizing scurl/cgurl.pyx
running build_ext
building 'scurl.cgurl' extension
creating build
creating build/temp.macosx-10.14-x86_64-3.7
creating build/temp.macosx-10.14-x86_64-3.7/scurl
creating build/temp.macosx-10.14-x86_64-3.7/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/strings
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/third_party/icu
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url/third_party/mozilla
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I. -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c scurl/cgurl.cpp -o build/temp.macosx-10.14-x86_64-3.7/scurl/cgurl.o -std=gnu++14 -I./third_party/chromium/ -fPIC -Ofast -pthread -w -DU_COMMON_IMPLEMENTATION
scurl/cgurl.cpp:638:10: fatal error: 
      '../third_party/chromium/url/third_party/mozilla/url_parse.h' file not
      found
#include "../third_party/chromium/url/third_party/mozilla/url_parse.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
error: command 'clang' failed with exit status 1
make: *** [build_ext] Error 1

The offending folder exists but is empty.

nyov commented 5 years ago

This is maybe, because that subfolder lies within a git submodule. Did you git submodule init?

ddebernardy commented 5 years ago

@nyov: not that I can recollect - certainly not if this wasn't in the installation instructions...

ddebernardy commented 5 years ago

This did the trick before running pip install -r requirements.txt:

git submodule init
git submodule update --init --recursive
ddebernardy commented 5 years ago

But then it segfaults on one of the urls in my dataset. Oh well...

Traceback (most recent call last):
  File "./bin/triage_links", line 34, in get_url_parts
    link = urljoin(record.url, record.href)
  File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
  File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/triage_links", line 102, in <module>
    main()
  File "./bin/triage_links", line 13, in main
    CSVPipeline(callback=process).execute()
  File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
    self.save_csv()
  File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
    df = df.compute()
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11

(The offending strings are buried in a file with millions of entries, so I'm afraid I can't locate it easily, but the utf8 encoding related error is hopefully good enough a hint as to what the issue is.)

nyov commented 5 years ago

Glad you managed to figure it out, I forgot about the "update init". The error sucks, but if you think it's a bug, it should be a new ticket.

You could throw in some logging, to get the position in the file (dump a part of the raw bytes string of the line, to grep for, or something).

(I haven't actually used scurl or I might help you with that error. But looks obvious: wrong encoding on some of your text ➡ mojibake.)

ddebernardy commented 5 years ago

No no, that's totally a bug in the library; not the data. The library is supposed to join urls from out there in the wild (this being part of scrapy), so it cannot possibly expect valid data, let alone segfault when it encounters anything wrong.

To reproduce, run a broad crawl on this dataset and extract all links:

https://www.kaggle.com/cheedcheed/top1m

use urljoin() and urlsplit() on each one.

lopuhin commented 5 years ago

Thanks for reports @ddebernardy and for the help @nyov , I created a separate issue to track the segfault/encodig issue.