jupyterlab / jupyterlab

JupyterLab computational environment.
https://jupyterlab.readthedocs.io/
Other
14.23k stars 3.41k forks source link

CSV parser skip lines and comment lines #4117

Open jzf2101 opened 6 years ago

jzf2101 commented 6 years ago

@jasongrout

screen shot 2018-03-05 at 12 20 56

FWIW pandas says

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-3-52a6be6cb0f4> in <module>()
----> 1 binder = pd.read_csv('binder_analytics.csv')
      2 binder.head()

~/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

~/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    453 
    454     try:
--> 455         data = parser.read(nrows)
    456     finally:
    457         parser.close()

~/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1067                 raise ValueError('skipfooter not supported for iteration')
   1068 
-> 1069         ret = self._engine.read(nrows)
   1070 
   1071         if self.options.get('as_recarray'):

~/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1837     def read(self, nrows=None):
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:
   1841             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 8
ian-r-rose commented 6 years ago

There are some comment lines at the start of the file that are confusing both parsers. Remove those and it works fine, as far as I can tell.

This is a consequence of the underspecified format that is CSV. We could try to skip lines starting with some set of specified characters designated as comment signifiers, but that is likely error prone.

jasongrout commented 6 years ago

A lot of parsers out there have a setting for a comment line.

I agree that in the short term, you'll need to preprocess this file to get it to work. In the long term, we'll have a setting for a line comment character.

jzf2101 commented 6 years ago

Do we just want to close?

jasongrout commented 6 years ago

Perhaps we just convert this to a feature request for a comment character. We'll also need an option to skip empty lines (this has an empty line before the actual fields start).

jasongrout commented 6 years ago

(but these changes won't fix pandas. For pandas, you'll also need to either strip out those lines, or figure out how to tell pandas about them.)

jzf2101 commented 6 years ago

That's pandas problem I'm just thinking on our end

jasongrout commented 6 years ago

Changing the issue title for the two items noted above.