Open paigem opened 3 years ago
Thanks for sharing Paige!
It's weird that Dask chokes on the file, since clear it is using Pandas under the hood! It actually seems like a Dask bug. I recommend raising a Dask issue. To do this, you will want to simplify your example even further into
url = "http://put the full url here"
df = dd.read_csv(url)
Thanks for your input @rabernat! I will make a dask issue about this now.
Yep it’s strange as I’ve tried to download the file and open locally and it fails, yet it seems look fine in excel I think it’s something to do with return chars and escape sequences of quotes around that line
Get Outlook for iOShttps://aka.ms/o0ukef
From: paigem @.> Sent: Thursday, May 20, 2021 10:39:22 AM To: pangeo-data/pangeo-tutorial-gallery @.> Cc: Mortimer, Nick (O&A, IOMRC Crawley) @.>; Mention @.> Subject: Re: [pangeo-data/pangeo-tutorial-gallery] Loading data currently yields ParserError (#11)
Thanks for your input @rabernathttps://github.com/rabernat! I will make a dask issue about this now.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pangeo-data/pangeo-tutorial-gallery/issues/11#issuecomment-844639041, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABBDKH7Q276PRBOACZB7IJ3TORY5VANCNFSM45F4HLGQ.
Thanks for initially flagging this error @NickMortimer! We don't want our beginner friendly tutorials to be broken!
Depending on how long it takes to fix this Dask bug, it might be worth making a PR with the pandas library fix for now. @NickMortimer - want to make a PR for your fix? 🙂
As Martin mentioned over in the upstream Dask issue (xref https://github.com/dask/dask/issues/7680#issuecomment-845285457), a quickfix for now is to pass sample=False
to dask.dataframe.read_csv
:
In [1]: import dask.dataframe as dd
...: url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:
...: Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
...: df = dd.read_csv(url, blocksize=None, sample=False)
In [2]: df
Out[2]:
Dask DataFrame Structure:
FID Volcano_Number Volcano_Name Primary_Volcano_Type Last_Eruption_Year Country Geological_Summary Region Subregion Latitude Longitude Elevation Tectonic_Setting Geologic_Epoch Evidence_Category Primary_Photo_Link Primary_Photo_Caption Primary_Photo_Credit Major_Rock_Type GeoLocation
npartitions=1
object int64 object object float64 object object object object float64 float64 int64 object object object object object object object object
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: read-csv, 1 tasks
Thanks @jrbourbeau! Good suggestion. This quick fix is cleaner than importing first through the pandas library.
I just forked the repo to prepare a pull request and it all works in my environment on local PC so this could be a version issue in the pangeo binder session?
dask version=2.17.2,pandas version=1.0.5 on my local machine and all is fine
Hmm locally I get the same pandas.errors.ParserError
when using dask=2.17.2
and pandas=1.0.5
. That is
import dask
import pandas as pd
import dask.dataframe as dd
print(f"{dask.__version__ = }")
print(f"{pd.__version__ = }")
url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
df = dd.read_csv(url, blocksize=None)
outputs
dask.__version__ = '2.17.2'
pd.__version__ = '1.0.5'
Traceback (most recent call last):
File "test.py", line 9, in <module>
df = dd.read_csv(url, blocksize=None)
File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 568, in read
return read_pandas(
File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 446, in read_pandas
head = reader(BytesIO(b_sample), **kwargs)
File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 929, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2071, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 172
For now, my guess it adding sample=False
might be the most robust quickfix
I've made a pull request for this #14 I'm new to the whole pull request thing so feedback welcome...
The
dask.ipynb
notebook currently yields aParserError
when loading the volcano data. The line of code that breaks:The error can be found below:
ParserError
--------------------------------------------------------------------------- ParserError Traceback (most recent call last)This appears to be due to an abnormal parsing in the datafile itself. The data can be successfully loaded using the pandas library instead, as shown by @NickMortimer during a workshop at the Dask Distributed Summit. 🙂 If the above line of code is replaced with:
then the data loads just fine. So the above three lines of code are an easy fix, unless someone else has an idea how to load the data using
dask.dataframe
directly.