scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 272 forks source link

ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long' #102

Open aceri opened 7 years ago

aceri commented 7 years ago

Hi, I am having the following problem. Not sure if i am following the right steps. This is the repro. Regards,


--------------------------------------
root
--------------------------------------
root@tex:/home/scraper# python --version
Python 3.4.3+
root@tex:/home/scraper# virtualenv venv_scrapely
Using base prefix '/usr'
New python executable in /home/scraper/venv_scrapely/bin/python3
Also creating executable in /home/scraper/venv_scrapely/bin/python
Installing setuptools, pip, wheel...done.
root@tex:/home/scraper# ls -lrt
total 4
drwxr-xr-x 5 root root 4096 Feb  6 18:23 venv_scrapely
root@tex:/home/scraper# source ./venv_scrapely/bin/activate
(venv_scrapely) root@tex:/home/scraper# pip install scrapely
Collecting scrapely
Collecting w3lib (from scrapely)
  Using cached w3lib-1.16.0-py2.py3-none-any.whl
Collecting numpy (from scrapely)
  Using cached numpy-1.12.0-cp34-cp34m-manylinux1_i686.whl
Requirement already satisfied: six in ./venv_scrapely/lib/python3.4/site-packages (from scrapely)
Installing collected packages: w3lib, numpy, scrapely
Successfully installed numpy-1.12.0 scrapely-0.13.3 w3lib-1.16.0
(venv_scrapely) root@tex:/home/scraper#
(venv_scrapely) root@tex:/home/scraper# pip list
(1.4.0)
numpy (1.12.0)
packaging (16.8)
pip (9.0.1)
pyparsing (2.1.10)
scrapely (0.13.3)
setuptools (34.1.1)
six (1.10.0)
w3lib (1.16.0)
wheel (0.29.0)
------------------------
with user scraper
------------------------
scraper@tex:$ source ./venv_scrapely/bin/activate
(venv_scrapely) scraper@tex:~$ python --version
Python 3.4.3+
(venv_scrapely) scraper@tex:~$ python
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
*** from scrapely import Scraper
*** s=Scraper()
*** url1='https://github.com/ripple/rippled'
*** data={'name':'ripple/rippled','commits':'11,292','releases':'66','contributors':'56'}
*** s.train(url1,data)
*** url2='https://github.com/scrapy/scrapely/'
*** s.scrape(url2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/__init__.py", line 53, in scrape
    return self.scrape_page(page)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/__init__.py", line 59, in scrape_page
    return self._ex.extract(page)[0]
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/__init__.py", line 119, in extract
    extracted = extraction_tree.extract(extraction_page)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 575, in extract
    items.extend(extractor.extract(page, start_index, end_index, self.template.ignored_regions))
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 351, in extract
    _, _, attributes = self._doextract(page, extractors, start_index, end_index, **kwargs)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 396, in _doextract
    labelled, start_index, end_index_exclusive, self.best_match, **kwargs)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/similarity.py", line 148, in similar_region
    data_length - range_end, data_length - range_start)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/similarity.py", line 85, in longest_unique_subsequence
    matches = naive_match_length(to_search, subsequence, range_start, range_end)
  File "scrapely/extraction/_similarity.pyx", line 155, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3845)
  File "scrapely/extraction/_similarity.pyx", line 158, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3648)
  File "scrapely/extraction/_similarity.pyx", line 87, in scrapely.extraction._similarity.np_naive_match_length (scrapely/extraction/_similarity.c:2802)
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'
```bash
aschi2 commented 7 years ago

I got the same error running the example code:

from scrapely import Scraper

s = Scraper()

url1 = 'http://pypi.python.org/pypi/w3lib/1.1'

data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}

s.train(url1, data)

url2 = 'http://pypi.python.org/pypi/Django/1.3'

s.scrape(url2)

Gives me the same error.

ruairif commented 7 years ago

@aceri, @aschi2 I'm unable to replicate the issue. I guess both of you are using 32 bit systems and that is causing problems. If you can confirm you are using 32 bit systems I can add a fallback to just use the python implementation on 32 bit systems

aschi2 commented 7 years ago

I am using a 64bit system and 64bit Python 2.7.

pavelmalai commented 7 years ago

I get the exact same error, 64 bit system.

hackrush01 commented 7 years ago

I can't replicate the issue as well. @ruairif I have some doubts in the six library

This is the code for finding the maxsize

class X(object):

            def __len__(self):
                return 1 << 31
        try:
            len(X())
        except OverflowError:
            # 32-bit
            MAXSIZE = int((1 << 31) - 1)
        else:
            # 64-bit
            MAXSIZE = int((1 << 63) - 1)
        del X

According to me in def __len__(self) return value should be 1 << 63

If this is valid could this be a source of the problem?

bhavsarpratik commented 7 years ago

I am also facing the same problem on Python 3.5 64bit Windows!

andreylisovskiy commented 6 years ago

I have same issue on Python 2.7.11 MSC v.1500 64 bit (AMD64) on win32 under virtual environment. No answers yet?

Navid61 commented 6 years ago

I've same problem with Python 3.6.3 32bit on windwos 10 Enterprise X64

hiadore commented 6 years ago

I got the same problem on Python 2.7.13 64 bit in both System wide and under virtual environment, Windows 10 Home.

indywidualny commented 6 years ago

The same (similar?) bug here. Python 2.7.14 as venv, MacOS High Sierra.

ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'double'

@ruairif It may be hard to reproduce because a bug is pretty rare. It's present in only 2% of my tests. It occurred 5 times, total 203 trials.

bitblomster commented 6 years ago

I am getting this error consistently, regardless of input data. Even the small example on the front page of scrapely's github, that illustrates how to scrape pypi, fails with this error.

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] Windows 7, 64-bit.

numpy (1.14.0) pip (9.0.1) scrapely (0.13.4) setuptools (28.8.0) six (1.11.0) w3lib (1.18.0)

hiadore commented 6 years ago

Hi @bitblomster, I'm too. Just in Windows. I've no issue with scrapely on Ubuntu.

But something interesting happened. I copied scrapely folder from my Ubuntu Python environment (in site packages) into my Windows, at the same folder with my project that using scrapely. All issue is gone, scrapely working properly afther this. @ruairif , may something missing on scrapely on Windows?

dbenitog commented 6 years ago

I keep getting the same error in Windows whenever I try to scrape a website (using the API as well as using the command line):

Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)] on win32
[...]
 File "scrapely/extraction/_similarity.pyx", line 155, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3845)
    cpdef naive_match_length(sequence, pattern, int start=0, int end=-1):
  File "scrapely/extraction/_similarity.pyx", line 158, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3648)
    return np_naive_match_length(sequence, pattern, start, end)
  File "scrapely/extraction/_similarity.pyx", line 87, in scrapely.extraction._similarity.np_naive_match_length (scrapely/extraction/_similarity.c:2802)
    cdef np_naive_match_length(np.ndarray[np.int64_t, ndim=1] sequence,
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

I've managed to try it on Ubuntu with another computer: it works, no issue found when scraping. I tried to copy the Ubuntu scrapely folder to Windows, as @hiadore suggested, but I'm still finding the same exact error. I have no clue!

ramedey commented 6 years ago

I also have exactly the same problem on Windows 10. Any workarounds?

pawelkmiec commented 6 years ago

@ramedey same issue here, but I'm having initial success with running scrapely with https://docs.microsoft.com/en-us/windows/wsl/about (example from readme works :) )

ronaldgreeff commented 6 years ago

I have the same issue. The problem lies with numpy (scrapely dependency) and how it treats int on a 32bit and 64bit windows system differently.

maximeboun commented 5 years ago

Any workarounds on this issue?