great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.7k stars 1.5k forks source link

No option to select encoding - UnicodeDecodeError #9998

Open IgorShcherbakov opened 1 month ago

IgorShcherbakov commented 1 month ago

UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 375: character maps to

It is not possible to explicitly select the encoding!!! When opening files in the "expectations" folder, for example test.json the system value is taken, for example: with open("gx/expectations/test.json", "r") as file: ... ... ... encoding='cp1251' will be used, but I would like to use utf-8

Kilo59 commented 1 month ago

@IgorShcherbakov I think utf-8 is the default encoding for python. It's possible something in your environment is overriding this.

From this SO question, it looks like you can set an environment variable PYTHONIOENCODING.

export PYTHONIOENCODING=utf8

https://stackoverflow.com/a/27066059/6304433

Let me know if that addresses the issue.

IgorShcherbakov commented 1 month ago

@Kilo59 print(os.getenv("PYTHONIOENCODING")) returns UTF-8

"Exception type": "UnicodeDecodeError",
"Message": "'charmap' codec can't decode byte 0x98 in position 497: character maps to <undefined>"
"Details": "Traceback (most recent call last):
File \"C:\\...\\dwh-greatexpectations\\main.py\", line 188, in get_checkpoint_result    checkpoint_result = checkpoint.run(
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\core\\usage_statistics\\usage_statistics.py\", line 266, in usage_statistics_wrapped_method
result = func(*args, **kwargs)
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\checkpoint\\checkpoint.py\", line 305, in run
self._run_validation(
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\checkpoint\\checkpoint.py\", line 480, in _run_validation
validator: Validator = self._validator or self.data_context.get_validator(                                              
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\data_context\\data_context\\abstract_data_context.py\", line 2336, in get_validator
expectation_suite = self.get_expectation_suite(
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\data_context\\data_context\\abstract_data_context.py\", line 3022, in get_expectation_suite
dict, self.expectations_store.get(key)
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\data_context\\store\\expectations_store.py\", line 210, in get
return super().get(key)  # type: ignore[return-value]
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\data_context\\store\\store.py\", line 207, in get
value = self._store_backend.get(self.key_to_tuple(key))
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\data_context\\store\\_store_backend.py\", line 123, in get
value = self._get(key, **kwargs)  
File \"C:\\...\\dwh-greatexpectations\\venv\\Lib\\site-packages\\great_expectations\\data_context\\store\\tuple_store_backend.py\", line 321, in _get
contents: str = infile.read().rstrip(\"\\n\")
File \"C:\\Python311\\Lib\\encodings\\cp1251.py\", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]          
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 497: character maps to <undefined>"
Kilo59 commented 2 weeks ago

@IgorShcherbakov

AFAIK, there's nowhere in our codebase where we use something other than utf-8 for encoding. I think there's something going on with your particular environment.

If you can provide a minimal reproducible example, I'd be happy to run it and could debug from there.