pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.57k stars 1.98k forks source link

What data is used for the tutorial? #2854

Closed AllanLeanderRostockHansen closed 2 years ago

AllanLeanderRostockHansen commented 2 years ago

What data is used for the Python tutorial?

The congress dataset is imported from a dataset package (this section in the tutorial). While the data seems to be available in this repo, I'm not sure which of the available datasets to use.

Possible bug when reading a CSV from a web-address

In the issues for this repo, I've found a reference to the dataset in the URL below, but reading it from the web-address using pl.read_csv results in a SSL error. Using requests.get to fetch the data works without hiccups:

This works

import polars as pl
import requests
import io

req = requests.get('https://theunitedstates.io/congress-legislators/legislators-current.csv')
filelike = io.StringIO(req.text)
df = pl.read_csv(filelike)

This fails

Notice that I'm using an Anaconda Python (3.9.10) distribution on Windows 10, which is known to have some SSL issues.

import polars as pl

df = pl.read_csv('https://theunitedstates.io/congress-legislators/legislators-current.csv')

And this is the error

---------------------------------------------------------------------------
SSLCertVerificationError                  Traceback (most recent call last)
File ~\Miniconda3\envs\py39\lib\urllib\request.py:1346, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   1345 try:
-> 1346     h.request(req.get_method(), req.selector, req.data, headers,
   1347               encode_chunked=req.has_header('Transfer-encoding'))
   1348 except OSError as err: # timeout error

File ~\Miniconda3\envs\py39\lib\http\client.py:1285, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
   1284 """Send a complete request to the server."""
-> 1285 self._send_request(method, url, body, headers, encode_chunked)

File ~\Miniconda3\envs\py39\lib\http\client.py:1331, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
   1330     body = _encode(body, 'body')
-> 1331 self.endheaders(body, encode_chunked=encode_chunked)

File ~\Miniconda3\envs\py39\lib\http\client.py:1280, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   1279     raise CannotSendHeader()
-> 1280 self._send_output(message_body, encode_chunked=encode_chunked)

File ~\Miniconda3\envs\py39\lib\http\client.py:1040, in HTTPConnection._send_output(self, message_body, encode_chunked)
   1039 del self._buffer[:]
-> 1040 self.send(msg)
   1042 if message_body is not None:
   1043 
   1044     # create a consistent interface to message_body

File ~\Miniconda3\envs\py39\lib\http\client.py:980, in HTTPConnection.send(self, data)
    979 if self.auto_open:
--> 980     self.connect()
    981 else:

File ~\Miniconda3\envs\py39\lib\http\client.py:1454, in HTTPSConnection.connect(self)
   1452     server_hostname = self.host
-> 1454 self.sock = self._context.wrap_socket(self.sock,
   1455                                       server_hostname=server_hostname)

File ~\Miniconda3\envs\py39\lib\ssl.py:500, in SSLContext.wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
    494 def wrap_socket(self, sock, server_side=False,
    495                 do_handshake_on_connect=True,
    496                 suppress_ragged_eofs=True,
    497                 server_hostname=None, session=None):
    498     # SSLSocket class handles server_hostname encoding before it calls
    499     # ctx._wrap_socket()
--> 500     return self.sslsocket_class._create(
    501         sock=sock,
    502         server_side=server_side,
    503         do_handshake_on_connect=do_handshake_on_connect,
    504         suppress_ragged_eofs=suppress_ragged_eofs,
    505         server_hostname=server_hostname,
    506         context=self,
    507         session=session
    508     )

File ~\Miniconda3\envs\py39\lib\ssl.py:1040, in SSLSocket._create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
   1039             raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
-> 1040         self.do_handshake()
   1041 except (OSError, ValueError):

File ~\Miniconda3\envs\py39\lib\ssl.py:1309, in SSLSocket.do_handshake(self, block)
   1308         self.settimeout(None)
-> 1309     self._sslobj.do_handshake()
   1310 finally:

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
Input In [16], in <cell line: 2>()
      1 # df = pl.read_csv('legislators-current.csv')
----> 2 df = pl.read_csv("https://theunitedstates.io/congress-legislators/legislators-current.csv") 
      3 df.columns

File ~\src\polars-play\.venv\lib\site-packages\polars\io.py:395, in read_csv(file, has_header, columns, new_columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, **kwargs)
    389         # Change new column names to current column names in dtype.
    390         dtypes = {
    391             new_to_current.get(column_name, column_name): column_dtype
    392             for column_name, column_dtype in dtypes.items()
    393         }
--> 395 with _prepare_file_arg(file, **storage_options) as data:
    396     df = DataFrame._read_csv(
    397         file=data,
    398         has_header=has_header,
   (...)
    417         row_count_offset=row_count_offset,
    418     )
    420 if new_columns:

File ~\src\polars-play\.venv\lib\site-packages\polars\io.py:120, in _prepare_file_arg(file, **kwargs)
    118         return fsspec.open(file, **kwargs)
    119     if file.startswith("http"):
--> 120         return _process_http_file(file)
    121 if isinstance(file, list) and bool(file) and all(isinstance(f, str) for f in file):
    122     if _WITH_FSSPEC:

File ~\src\polars-play\.venv\lib\site-packages\polars\io.py:60, in _process_http_file(path)
     59 def _process_http_file(path: str) -> BytesIO:
---> 60     with urlopen(path) as f:
     61         return BytesIO(f.read())

File ~\Miniconda3\envs\py39\lib\urllib\request.py:214, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    212 else:
    213     opener = _opener
--> 214 return opener.open(url, data, timeout)

File ~\Miniconda3\envs\py39\lib\urllib\request.py:517, in OpenerDirector.open(self, fullurl, data, timeout)
    514     req = meth(req)
    516 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> 517 response = self._open(req, data)
    519 # post-process response
    520 meth_name = protocol+"_response"

File ~\Miniconda3\envs\py39\lib\urllib\request.py:534, in OpenerDirector._open(self, req, data)
    531     return result
    533 protocol = req.type
--> 534 result = self._call_chain(self.handle_open, protocol, protocol +
    535                           '_open', req)
    536 if result:
    537     return result

File ~\Miniconda3\envs\py39\lib\urllib\request.py:494, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    492 for handler in handlers:
    493     func = getattr(handler, meth_name)
--> 494     result = func(*args)
    495     if result is not None:
    496         return result

File ~\Miniconda3\envs\py39\lib\urllib\request.py:1389, in HTTPSHandler.https_open(self, req)
   1388 def https_open(self, req):
-> 1389     return self.do_open(http.client.HTTPSConnection, req,
   1390         context=self._context, check_hostname=self._check_hostname)

File ~\Miniconda3\envs\py39\lib\urllib\request.py:1349, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   1346         h.request(req.get_method(), req.selector, req.data, headers,
   1347                   encode_chunked=req.has_header('Transfer-encoding'))
   1348     except OSError as err: # timeout error
-> 1349         raise URLError(err)
   1350     r = h.getresponse()
   1351 except:

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>
ghuls commented 2 years ago

It looks like some of your root certificates are too old. Can you try to updating conda install ca-certificates.

Or install the certificates manually: https://medium.com/@codedigger/api-call-over-internet-fails-with-certificate-expired-error-from-09-30-2021-windows-838d3a793e9f

zundertj commented 2 years ago

The source code can be found here: https://github.com/pola-rs/polars-book/tree/master/user_guide/src/examples/groupby_dsl, which defines dataset here: https://github.com/pola-rs/polars-book/blob/master/user_guide/src/examples/groupby_dsl/dataset.py. So the data being pulled is this url: https://theunitedstates.io/congress-legislators/legislators-historical.csv.

frankvgompel commented 2 years ago

I get a similar error with:

df = pl.read_csv( "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv")

And this message:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1037, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 975, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1454, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 512, in wrap_socket
    return self.sslsocket_class._create(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1070, in _create
    self.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1341, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3369, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-101-c42ee15dec88>", line 1, in <cell line: 1>
    df = pl.read_csv(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/polars/io.py", line 397, in read_csv
    with _prepare_file_arg(file, **storage_options) as data:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/polars/io.py", line 120, in _prepare_file_arg
    return _process_http_file(file)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/polars/io.py", line 60, in _process_http_file
    with urlopen(path) as f:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>
ritchie46 commented 2 years ago

Have you tried https://github.com/pola-rs/polars/issues/2854#issuecomment-1061749335?

frankvgompel commented 2 years ago

No, that is obviously for windows, besides this MacBook is only a week old.

zundertj commented 2 years ago

Please note that although the solution (https://github.com/pola-rs/polars/issues/2854#issuecomment-1061749335) has Windows-specific instruction, it is the same ssl certficate issue, as the last line of your stack trace says:

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>

There may have been an issue with the ssl certificate of the website at the time. I cannot verify this, as the current certificate (from Let's Encrypt) is valid from Sun, 10 Apr 2022 02:01:10 GMT onwards, and is currently valid on my W10 machine.

Given that no update has been posted for a month, I am going to assume all has been resolved. Feel free to open a new issue if you run into any issues that cannot be resolved by updating the certificates.

braaannigan commented 11 months ago

I got this issue recently on a macbook and resolved it with

/Applications/Python\ 3.11/Install\ Certificates.command

where you might need to tab complete to get your python version