Open ddebernardy opened 5 years ago
This is maybe, because that subfolder lies within a git submodule.
Did you git submodule init
?
@nyov: not that I can recollect - certainly not if this wasn't in the installation instructions...
This did the trick before running pip install -r requirements.txt
:
git submodule init
git submodule update --init --recursive
But then it segfaults on one of the urls in my dataset. Oh well...
Traceback (most recent call last):
File "./bin/triage_links", line 34, in get_url_parts
link = urljoin(record.url, record.href)
File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./bin/triage_links", line 102, in <module>
main()
File "./bin/triage_links", line 13, in main
CSVPipeline(callback=process).execute()
File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
self.save_csv()
File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
df = df.compute()
File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11
(The offending strings are buried in a file with millions of entries, so I'm afraid I can't locate it easily, but the utf8 encoding related error is hopefully good enough a hint as to what the issue is.)
Glad you managed to figure it out, I forgot about the "update init". The error sucks, but if you think it's a bug, it should be a new ticket.
You could throw in some logging, to get the position in the file (dump a part of the raw bytes string of the line, to grep for, or something).
(I haven't actually used scurl or I might help you with that error. But looks obvious: wrong encoding on some of your text ➡ mojibake.)
No no, that's totally a bug in the library; not the data. The library is supposed to join urls from out there in the wild (this being part of scrapy), so it cannot possibly expect valid data, let alone segfault when it encounters anything wrong.
To reproduce, run a broad crawl on this dataset and extract all links:
https://www.kaggle.com/cheedcheed/top1m
use urljoin()
and urlsplit()
on each one.
Thanks for reports @ddebernardy and for the help @nyov , I created a separate issue to track the segfault/encodig issue.
I was following the install instructions from the README (macOS 10.14.5).
There was one warning about
... which I ignored. And then this failed:
The offending folder exists but is empty.