modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.86k stars 651 forks source link

read_csv incorrect output with float data #2634

Open amyskov opened 3 years ago

amyskov commented 3 years ago

System information

os.environ["MODIN_CPUS"] = "4" os.environ["MODIN_ENGINE"] = "ray"

import pandas import modin.pandas as pd from modin.pandas.test.utils import df_equals import numpy as np import csv

filename = "test_float.csv" float_precision = "round_trip" data_size = 5 random_state = np.random.RandomState(seed=42) data = ["col_name"] + random_state.uniform(low=0.0, high=10000.0, size=data_size).astype(str).tolist() data = "\n".join(data) kwargs = {"filepath_or_buffer": filename, "header": None}

try: with open(filename, "w") as f: f.write(data)

df_pandas = pandas.read_csv(**kwargs)
print("pandas.read_csv output:\n", df_pandas)
df_pd = pd.read_csv(**kwargs)
print("pd.read_csv output:\n", df_pd)
df_equals(df_pandas, df_pd)

finally: os.remove(filename)


<!--
You can obtain the Modin version with

python -c "import modin; print(modin.__version__)"
-->

### Describe the problem
<!-- Describe the problem clearly here. -->
Problem occurred because partitions that contains only float data (all partitions except the first) will read data as float values while pandas performs reading all data as strings.
### Source code / logs
<!-- Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem. -->

pandas.read_csv output: 0 0 col_name 1 3745.401188473625 2 9507.143064099162 3 7319.939418114051 4 5986.584841970366 5 1560.1864044243653 UserWarning: Ray execution environment not yet initialized. Initializing... To remove this warning, run the following python code before doing dataframe operations:

import ray
ray.init()

pd.read_csv output: 0 0 col_name 1 3745.401188473625 2 9507.14 3 7319.94 4 5986.58 5 1560.19 Traceback (most recent call last): File "test.py", line 272, in df_equals(df_pandas, df_pd) File "/modin/modin/pandas/test/utils.py", line 520, in df_equals assert_frame_equal( File "/miniconda3/envs/modin/lib/python3.8/site-packages/pandas/_testing.py", line 1611, in assert_frame_equal assert_series_equal( File "/miniconda3/envs/modin/lib/python3.8/site-packages/pandas/_testing.py", line 1394, in assert_series_equal _testing.assert_almost_equal( File "pandas/_libs/testing.pyx", line 67, in pandas._libs.testing.assert_almost_equal File "pandas/_libs/testing.pyx", line 182, in pandas._libs.testing.assert_almost_equal File "/miniconda3/envs/modin/lib/python3.8/site-packages/pandas/_testing.py", line 1036, in raise_assert_detail raise AssertionError(msg) AssertionError: DataFrame.iloc[:, 0] (column name="0") are different

DataFrame.iloc[:, 0] (column name="0") values are different (66.66667 %) [index]: [0, 1, 2, 3, 4, 5] [left]: [col_name, 3745.401188473625, 9507.143064099162, 7319.939418114051, 5986.584841970366, 1560.1864044243653] [right]: [col_name, 3745.401188473625, 9507.143064099162, 7319.9394181140515, 5986.584841970366, 1560.1864044243653]

pyrito commented 2 years ago

I am able to replicate the error on the latest master.