great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.87k stars 1.53k forks source link

Can't specify quoting method for CSV files #9956

Closed Laekda closed 1 month ago

Laekda commented 4 months ago

Describe the bug Using a fluent datasource for CSV files, I can't specify the quoting method used to write the CSV file. The program raise a TypeError

To Reproduce Before adding the quoting element everything was fine. So I think that any checkpoint linked to this datasource should work. in great_expectations.yml:

  fluent_datasources:
    datasource_name:
      type: pandas_filesystem
      assets:
        om:
          type: csv
          batching_regex: om_(?P<number>\d{2})\.csv
          quoting: 4

Error :

Traceback (most recent call last):
  File "C:\path\to\project\cmd\qdd.py", line 80, in <module>
    raise(e)
  File "C:\path\to\project\cmd\qdd.py", line 67, in <module>
    run_checkpoints(context_path)
  File "C:\path\to\project\.venv\Lib\site-packages\gx_management\great_expectations\__init__.py", line 69, in run_checkpoints
    raise e
  File "C:\path\to\project\.venv\Lib\site-packages\gx_management\great_expectations\__init__.py", line 61, in run_checkpoints
    res = context.run_checkpoint(checkpoint_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\core\usage_statistics\usage_statistics.py", line 266, in usage_statistics_wrapped_method
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\data_context\data_context\abstract_data_context.py", line 2107, in run_checkpoint
    return self._run_checkpoint(
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\data_context\data_context\abstract_data_context.py", line 2151, in _run_checkpoint
    result: CheckpointResult = checkpoint.run_with_runtime_args(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\checkpoint\checkpoint.py", line 915, in run_with_runtime_args
    return self.run(**checkpoint_run_arguments)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\core\usage_statistics\usage_statistics.py", line 266, in usage_statistics_wrapped_method
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\checkpoint\checkpoint.py", line 306, in run
    self._run_validation(
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\checkpoint\checkpoint.py", line 481, in _run_validation
    validator: Validator = self._validator or self.data_context.get_validator(
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\data_context\data_context\abstract_data_context.py", line 2374, in get_validator
    self.get_batch_list(
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\core\usage_statistics\usage_statistics.py", line 266, in usage_statistics_wrapped_method
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\data_context\data_context\abstract_data_context.py", line 2545, in get_batch_list
    return self._get_batch_list(
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\data_context\data_context\abstract_data_context.py", line 2626, in _get_batch_list
    return datasource.get_batch_list_from_batch_request(batch_request=result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\datasource\fluent\interfaces.py", line 474, in get_batch_list_from_batch_request
    return data_asset.get_batch_list_from_batch_request(batch_request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\datasource\fluent\file_path_data_asset.py", line 275, in get_batch_list_from_batch_request
    batch_data, batch_markers = execution_engine.get_batch_data_and_markers(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\great_expectations\execution_engine\pandas_execution_engine.py", line 338, in get_batch_data_and_markers
    df = reader_fn(path, **reader_options)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\pandas\io\parsers\readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\pandas\io\parsers\readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\pandas\io\parsers\readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\pandas\io\parsers\readers.py", line 1898, in _make_engine
    return mapping[engine](f, **self.options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\path\to\project\.venv\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "parsers.pyx", line 463, in pandas._libs.parsers.TextReader.__cinit__
  File "parsers.pyx", line 604, in pandas._libs.parsers.TextReader._set_quoting
TypeError: bad "quoting" value

Expected behavior A column with quoted digits in CSV must be seen as a string datatype and not int or float datatype.

Environment (please complete the following information):

Additional context Only Strings type are quoted and some columns have digits but are expected as strings.

Kilo59 commented 4 months ago

@Laekda All of the parameters that can be supplied to pandas.read_csv() (or any other asset type) can be supplied for CSVAssets and it looks like it is passing it through to pandas. https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv

What version of pandas are you using? Can you read any of these CSVs with pandas directly using that quoting value?

https://github.com/great-expectations/great_expectations/blob/1e21c9b95f0eaed7511377ef54fc438f8482e063/great_expectations/datasource/fluent/pandas_filesystem_datasource.pyi#L104

Maybe that value is being read in as a string "4" instead of an int. 🤔

molliemarie commented 1 month ago

Hello @Laekda. With the launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.

To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).

You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.

Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗