pangeo-data / pangeo-tutorial-gallery

Repo to house pangeo-tutorial notebooks for pangeo-gallery
MIT License
10 stars 13 forks source link

Loading data currently yields ParserError #11

Open paigem opened 3 years ago

paigem commented 3 years ago

The dask.ipynb notebook currently yields a ParserError when loading the volcano data. The line of code that breaks:

df = dd.read_csv(server+query, blocksize=None)

The error can be found below:

ParserError --------------------------------------------------------------------------- ParserError Traceback (most recent call last) in 6 7 # blocksize=None means use a single partion ----> 8 df = dd.read_csv(server+query, blocksize=None) /srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 578 storage_options=storage_options, 579 include_path_column=include_path_column, --> 580 **kwargs, 581 ) 582 /srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 444 445 # Use sample to infer dtypes and check for presence of include_path_column --> 446 head = reader(BytesIO(b_sample), **kwargs) 447 if include_path_column and (include_path_column in head.columns): 448 raise ValueError( /srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 674 ) 675 --> 676 return _read(filepath_or_buffer, kwds) 677 678 parser_f.__name__ = name /srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 452 453 try: --> 454 data = parser.read(nrows) 455 finally: 456 parser.close() /srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1131 def read(self, nrows=None): 1132 nrows = _validate_integer("nrows", nrows) -> 1133 ret = self._engine.read(nrows) 1134 1135 # May alter columns / col_dict /srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 2035 def read(self, nrows=None): 2036 try: -> 2037 data = self._reader.read(nrows) 2038 except StopIteration: 2039 if self._first_chunk: pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error() ParserError: Error tokenizing data. C error: EOF inside string starting at row 172

This appears to be due to an abnormal parsing in the datafile itself. The data can be successfully loaded using the pandas library instead, as shown by @NickMortimer during a workshop at the Dask Distributed Summit. 🙂 If the above line of code is replaced with:

import pandas as pd
df = pd.read_csv(server+query)
df = dd.from_pandas(df,npartitions=1)

then the data loads just fine. So the above three lines of code are an easy fix, unless someone else has an idea how to load the data using dask.dataframe directly.

rabernat commented 3 years ago

Thanks for sharing Paige!

It's weird that Dask chokes on the file, since clear it is using Pandas under the hood! It actually seems like a Dask bug. I recommend raising a Dask issue. To do this, you will want to simplify your example even further into

url = "http://put the full url here"
df = dd.read_csv(url)
paigem commented 3 years ago

Thanks for your input @rabernat! I will make a dask issue about this now.

NickMortimer commented 3 years ago

Yep it’s strange as I’ve tried to download the file and open locally and it fails, yet it seems look fine in excel I think it’s something to do with return chars and escape sequences of quotes around that line

Get Outlook for iOShttps://aka.ms/o0ukef


From: paigem @.> Sent: Thursday, May 20, 2021 10:39:22 AM To: pangeo-data/pangeo-tutorial-gallery @.> Cc: Mortimer, Nick (O&A, IOMRC Crawley) @.>; Mention @.> Subject: Re: [pangeo-data/pangeo-tutorial-gallery] Loading data currently yields ParserError (#11)

Thanks for your input @rabernathttps://github.com/rabernat! I will make a dask issue about this now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pangeo-data/pangeo-tutorial-gallery/issues/11#issuecomment-844639041, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABBDKH7Q276PRBOACZB7IJ3TORY5VANCNFSM45F4HLGQ.

paigem commented 3 years ago

Thanks for initially flagging this error @NickMortimer! We don't want our beginner friendly tutorials to be broken!

Depending on how long it takes to fix this Dask bug, it might be worth making a PR with the pandas library fix for now. @NickMortimer - want to make a PR for your fix? 🙂

jrbourbeau commented 3 years ago

As Martin mentioned over in the upstream Dask issue (xref https://github.com/dask/dask/issues/7680#issuecomment-845285457), a quickfix for now is to pass sample=False to dask.dataframe.read_csv:

In [1]: import dask.dataframe as dd
   ...: url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:
   ...: Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
   ...: df = dd.read_csv(url, blocksize=None, sample=False)

In [2]: df
Out[2]:
Dask DataFrame Structure:
                  FID Volcano_Number Volcano_Name Primary_Volcano_Type Last_Eruption_Year Country Geological_Summary  Region Subregion Latitude Longitude Elevation Tectonic_Setting Geologic_Epoch Evidence_Category Primary_Photo_Link Primary_Photo_Caption Primary_Photo_Credit Major_Rock_Type GeoLocation
npartitions=1
               object          int64       object               object            float64  object             object  object    object  float64   float64     int64           object         object            object             object                object               object          object      object
                  ...            ...          ...                  ...                ...     ...                ...     ...       ...      ...       ...       ...              ...            ...               ...                ...                   ...                  ...             ...         ...
Dask Name: read-csv, 1 tasks
paigem commented 3 years ago

Thanks @jrbourbeau! Good suggestion. This quick fix is cleaner than importing first through the pandas library.

NickMortimer commented 3 years ago

I just forked the repo to prepare a pull request and it all works in my environment on local PC so this could be a version issue in the pangeo binder session?

dask version=2.17.2,pandas version=1.0.5 on my local machine and all is fine

jrbourbeau commented 3 years ago

Hmm locally I get the same pandas.errors.ParserError when using dask=2.17.2 and pandas=1.0.5. That is

import dask
import pandas as pd
import dask.dataframe as dd

print(f"{dask.__version__ = }")
print(f"{pd.__version__ = }")

url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
df = dd.read_csv(url, blocksize=None)

outputs

dask.__version__ = '2.17.2'
pd.__version__ = '1.0.5'
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    df = dd.read_csv(url, blocksize=None)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 568, in read
    return read_pandas(
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 446, in read_pandas
    head = reader(BytesIO(b_sample), **kwargs)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 929, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2071, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 172

For now, my guess it adding sample=False might be the most robust quickfix

NickMortimer commented 3 years ago

I've made a pull request for this #14 I'm new to the whole pull request thing so feedback welcome...