Open lopuhin opened 5 years ago
For clarity, you only need to extract all links from the front page.
^ Seems to be enough to reproduce on my system.
$ python
Python 3.7.3 (default, Mar 27 2019, 09:23:15)
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from scurl.cgurl import urlparse
>>> df = pd.read_csv('data/scurl.csv')
>>> test = df.drop_duplicates()
>>> test.url.apply(lambda r: urlparse(r))
[... works fine...]
Name: url, Length: 2150750, dtype: object
>>> df.url.apply(lambda r: urlparse(r))
Segmentation fault: 11
That it works fine when I drop dups is somewhat intriguing. Maybe the code is running out of memory or something. (Maybe there's a memory leak in there somewhere?)
If you've more memory than I do and it doesn't choke on your system as a result, you can probably use df.append()
a few times to make it large enough to segfault.
See https://github.com/scrapy/scurl/issues/58#issuecomment-513520254 and https://github.com/scrapy/scurl/issues/58#issuecomment-513583355
Also repeating here
To reproduce, run a broad crawl on this dataset and extract all links:
https://www.kaggle.com/cheedcheed/top1m
use
urljoin()
andurlsplit()
on each one.