scrapy / scurl

Performance-focused replacement for Python urllib
Apache License 2.0
21 stars 6 forks source link

Segfault or encoding error when parsing a URL #59

Open lopuhin opened 5 years ago

lopuhin commented 5 years ago

See https://github.com/scrapy/scurl/issues/58#issuecomment-513520254 and https://github.com/scrapy/scurl/issues/58#issuecomment-513583355

Also repeating here

Traceback (most recent call last):
  File "./bin/triage_links", line 34, in get_url_parts
    link = urljoin(record.url, record.href)
  File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
  File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/triage_links", line 102, in <module>
    main()
  File "./bin/triage_links", line 13, in main
    CSVPipeline(callback=process).execute()
  File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
    self.save_csv()
  File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
    df = df.compute()
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11

To reproduce, run a broad crawl on this dataset and extract all links:

https://www.kaggle.com/cheedcheed/top1m

use urljoin() and urlsplit() on each one.

ddebernardy commented 5 years ago

For clarity, you only need to extract all links from the front page.

ddebernardy commented 5 years ago

scurl.csv.zip

^ Seems to be enough to reproduce on my system.

$ python
Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from scurl.cgurl import urlparse
>>> df = pd.read_csv('data/scurl.csv')
>>> test = df.drop_duplicates()
>>> test.url.apply(lambda r: urlparse(r))
[... works fine...]
Name: url, Length: 2150750, dtype: object
>>> df.url.apply(lambda r: urlparse(r))
Segmentation fault: 11
ddebernardy commented 5 years ago

That it works fine when I drop dups is somewhat intriguing. Maybe the code is running out of memory or something. (Maybe there's a memory leak in there somewhere?)

If you've more memory than I do and it doesn't choke on your system as a result, you can probably use df.append() a few times to make it large enough to segfault.