modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.9k stars 653 forks source link

TEST: test_to_stata is flaky in comparing file contents #4716

Open mvashishtha opened 2 years ago

mvashishtha commented 2 years ago

modin/pandas/test/test_io.py::TestStata::test_to_stata had a run here for test-compat-win (engine dask, python 3.6) that failed on the assertion that the output files were exactly equal. A rerun with the same code passed here.

As with #3706, I think we should check that the written dataframes are equal instead of checking that the file bytes are exactly the same.

mvashishtha commented 2 years ago

I was able to get 2 failures out of 10000 runs by running:

MODIN_ENGINE=dask pytest --count=10_000 modin/pandas/test/test_io.py::TestStata::test_to_stata

though pytest didn't want to show me the failures for some reason. I then retried that command but output the result to a file, and got 1 failure of 10_000, and this time I got the failure text:

Show failure ``` =================================== FAILURES =================================== _____________________ TestStata.test_to_stata[7013-10000] ______________________ self = def test_to_stata(self): modin_df, pandas_df = create_test_dfs(TEST_DATA) eval_to_file( > modin_obj=modin_df, pandas_obj=pandas_df, fn="to_stata", extension="stata" ) modin/pandas/test/test_io.py:2318: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ modin_obj = col1 col2 col3 col4 col5 0 0 4 8 12 0 1 1 5 9 13 0 2 2 6 10 14 0 3 3 7 11 15 0 pandas_obj = col1 col2 col3 col4 col5 0 0 4 8 12 0 1 1 5 9 13 0 2 2 6 10 14 0 3 3 7 11 15 0 fn = 'to_stata', extension = 'stata', fn_kwargs = {} unique_filename_modin = '/Users/maheshvashishtha/software_sources/modin/modin/pandas/test/io_tests_data/6a200ca80c6b11edafe9acde48001122.stata' unique_filename_pandas = '/Users/maheshvashishtha/software_sources/modin/modin/pandas/test/io_tests_data/6a200da20c6b11ed9c3cacde48001122.stata' last_exception = None, _ = 0, @py_assert3 = False @py_format5 = "assert False\n{False = assert_files_eq('/Users/maheshvashishtha/software_sources/modin/modin/pandas/test/io_tests_dat...rs/maheshvashishtha/software_sources/modin/modin/pandas/test/io_tests_data/6a200da20c6b11ed9c3cacde48001122.stata')\n}" def eval_to_file(modin_obj, pandas_obj, fn, extension, **fn_kwargs): """Helper function to test `to_` methods. Args: modin_obj: Modin DataFrame or Series to test `to_` method. pandas_obj: Pandas DataFrame or Series to test `to_` method. fn: name of the method, that should be tested. extension: Extension of the test file. """ unique_filename_modin = get_unique_filename(extension=extension) unique_filename_pandas = get_unique_filename(extension=extension) try: # parameter `max_retries=0` is set for `to_csv` function on Ray engine, # in order to increase the stability of tests, we repeat the call of # the entire function manually last_exception = None for _ in range(3): try: getattr(modin_obj, fn)(unique_filename_modin, **fn_kwargs) except EXCEPTIONS as exc: last_exception = exc continue break else: raise last_exception getattr(pandas_obj, fn)(unique_filename_pandas, **fn_kwargs) > assert assert_files_eq(unique_filename_modin, unique_filename_pandas) E AssertionError: assert False E + where False = assert_files_eq('/Users/maheshvashishtha/software_sources/modin/modin/pandas/test/io_tests_data/6a200ca80c6b11edafe9acde48001122.stata', '/Users/maheshvashishtha/software_sources/modin/modin/pandas/test/io_tests_data/6a200da20c6b11ed9c3cacde48001122.stata') modin/pandas/test/test_io.py:198: AssertionError ---------- coverage: platform darwin, python 3.6.8-final-0 ----------- Coverage XML written to file coverage.xml =========================== short test summary info ============================ FAILED modin/pandas/test/test_io.py::TestStata::test_to_stata[7013-10000] - A... ========== 1 failed, 9999 passed, 30002 warnings in 381.91s (0:06:21) ========== ```

on: