ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.07k stars 5.79k forks source link

[Core] Access violation on windows 11 when running modin workload #30493

Open gshimansky opened 2 years ago

gshimansky commented 2 years ago

What happened + What you expected to happen

I have a strong suspicion that this bug is specific to Windows 11 because before I upgraded my workstation to Windows 11 I could execute the same workload just fine on windows 10. The problem is reproducible on 40 cores Intel Cascade Lake system and 64 cores AMD Threadripper systems, both win11 22H2 build 22621.819.

Most often I get no exception or stack traces that are reduced to just few stack trace elements, but I was lucky and here is a full exception stack trace:

Traceback (most recent call last):
  File "C:\Users\gshim\Documents\work\data-science-processing-workload\launcher.py", line 188, in <module>
    main()
  File "C:\Users\gshim\Documents\work\data-science-processing-workload\launcher.py", line 184, in main
    benchmark.run()
  File "C:\Users\gshim\Documents\work\data-science-processing-workload\launcher.py", line 61, in run
    res = census_run(self._datafile)
  File "C:\Users\gshim\Documents\work\data-science-processing-workload\benchmarks\census.py", line 248, in run
    (_, X, y), res["ETL"] = measure(etl, df)
  File "C:\Users\gshim\Documents\work\data-science-processing-workload\benchmarks\census.py", line 240, in measure
    res = func(*args, **kw)
  File "C:\Users\gshim\Documents\work\data-science-processing-workload\benchmarks\census.py", line 166, in etl
    df = df[df["EDUC"] != -1]
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\pandas\base.py", line 3200, in __getitem__
    return self._getitem(key)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\pandas\dataframe.py", line 2985, in _getitem
    return self._getitem_column(key)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\pandas\dataframe.py", line 2401, in _getitem_column
    s = DataFrame(
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\pandas\dataframe.py", line 2035, in squeeze
    return Series(query_compiler=self._query_compiler)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\pandas\series.py", line 130, in __init__
    self._query_compiler = query_compiler.columnarize()
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\core\storage_formats\pandas\query_compiler.py", line 708, in columnarize
    len(self.index) == 1 and self.index[0] == MODIN_UNNAMED_SERIES_LABEL
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\core\storage_formats\pandas\query_compiler.py", line 82, in <lambda>
    return lambda self: self._modin_frame.index
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\core\dataframe\pandas\dataframe\dataframe.py", line 366, in _get_index
    self._index_cache, row_lengths = self._compute_axis_labels_and_lengths(0)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\core\dataframe\pandas\dataframe\dataframe.py", line 458, in _compute_axis_labels_and_lengths
    new_index, internal_idx = self._partition_mgr_cls.get_indices(axis, partitions)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\core\dataframe\pandas\partitioning\partition_manager.py", line 865, in get_indices
    new_idx = cls.get_objects_from_partitions(new_idx)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\core\execution\ray\implementations\pandas_on_ray\partitioning\partition_manager.py", line 114, in get_objects_from_partitions
    return RayWrapper.materialize(
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\modin\core\execution\ray\common\engine_wrapper.py", line 92, in materialize
    return ray.get(obj_id)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\ray\_private\worker.py", line 2283, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\ray\_private\worker.py", line 668, in get_objects
    data_metadata_pairs = self.core_worker.get_objects(
  File "python\ray\_raylet.pyx", line 1445, in ray._raylet.CoreWorker.get_objects

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\gshim\AppData\Local\miniconda\envs\modin-0.17.0-ray-2.1.0\lib\site-packages\ray\_private\worker.py", line 1590, in shutdown
    time.sleep(0.5)
KeyboardInterrupt
Windows fatal exception: access violation

Versions / Dependencies

Python 3.9.13 Modin 0.17.0 Ray 2.0.1, 2.1.0 Modin 0.17.0 is currently incompatible the latest version of Ray 2.1.0 (the reason is not that Modin doesn't work with most recent version of Ray, it is some conflict over redis version dependency), so to test 2.1.0 I used the following commands:

  1. Create conda environment with Ray 2.0.1
    conda create -n modin-0.17.0-new -c conda-forge --experimental-solver=libmamba modin-all=0.17.0 modin-ray=0.17.0 scikit-learn-intelex xgboost ipython ray-core=2.0.1
  2. Activate conda environment
    conda activate modin-0.17.0-new
  3. Install Ray 2.1.0
    pip install ray==2.1.0

    It works because pip doesn't check for version conflict with already installed Modin version. But the same behavior is observed with version 2.0.1 and I think some older versions as well.

Reproduction script

Workload is a benchmark script that uses Modin. Please clone it from my repo https://github.com/gshimansky/data-science-processing-workload . Execute it like this: python launcher.py -m census or python launcher.py -m taxi. Access violation happens in random moments of code execution.

Issue Severity

High: It blocks me from completing my task.

gshimansky commented 1 year ago

I've reproduced this problem on windows 10 too.

aregm commented 1 year ago

@sven1977 @rshin @robertnishihara - why is this issue marked as P2 priority? If there are no plans for supporting Ray as a first-class platform, please let me know. If there are, then please, let us know the adjusted priority and release plan. Thanks.