ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.95k stars 5.77k forks source link

[Bug]Windows fatal exception: access violation #22685

Closed babyblue1334 closed 2 years ago

babyblue1334 commented 2 years ago

Search before asking

Ray Component

Ray Core

What happened + What you expected to happen

(pid=None) [2022-02-28 09:19:28,849 C 12116 13028] dlmalloc.cc:121: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455 (pid=None) StackTrace Information (pid=None) pid=12232) edge_detect_time: top 5.417430400848389 id: M1_1646011148510_RO_R pid=12232) edge_ret {'img_id': 'M1_1646011148510_RO_R', 'img': 'H:/SmartScan/data/Img_edge/M1_1646011148510_RO_R_top.jpeg', 'direction': 'top', 'profile': 'H:/SmartScan/data/ProfileData/M1_1646011148510_RO_R_top.data'} pid=12124) yh_get_edge except pid=12232) Stack (most recent call first): pid=12232) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray_private\utils.py", line 117 in push_error_to_driver pid=12232) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\worker.py", line 432 in main_loop pid=12232) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\workers/default_worker.py", line 218 in pid=12124) edge_detect_time: bottom 7.907793283462524 id: M1_1646011148510_RO_R pid=12124) edge_ret {'img_id': 'M1_1646011148510_RO_R', 'img': 'H:/SmartScan/data/Img_edge/M1_1646011148510_RO_R_bottom.jpeg', 'direction': 'bottom', 'profile': 'H:/SmartScan/data/ProfileData/M1_1646011148510_RO_R_bottom.data'} pid=12124) Stack (most recent call first): pid=12124) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray_private\utils.py", line 117 in push_error_to_driver pid=12124) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\worker.py", line 432 in main_loop pid=12124) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\workers/default_worker.py", line 218 in pid=13860) img_detect_time 9.577348709106445 id: M1_1646011130684_RO_R pid=13860) Stack (most recent call first): pid=13860) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray_private\utils.py", line 117 in push_error_to_driver pid=13860) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\worker.py", line 432 in main_loop pid=13860) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\workers/default_worker.py", line 218 in pid=14784) img_detect_time 11.887166261672974 id: M1_1646011130684_RO_R pid=14784) Stack (most recent call first): pid=14784) File "C:\Users\86177\Ap pid=14784) ption_2 pid=14784) Dtion_2 pid=14784) ation_2 pid=14784) ttion_2 pid=14784) a\ion_2 pid=14784) Local\Programs\Python\Python38\lib\site-packages\ray_private\utils.py", line 117 in push_error_to_driver pid=14784) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\worker.py", line 432 in main_loop pid=14784) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\workers/default_worker.py", line 218 in pid=8524) img_detect_time 118.62317728996277 id: M1_1646011060111_RO_R pid=8524) Stack (most recent call first): pid=8524) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray_private\utils.py", line 117 in push_error_to_driver pid=8524) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\worker.py", line 432 in main_loop pid=8524) File "C:\Users\86177\AppData\Local\Programs\Python\Python38\lib\site-packages\ray\workers/default_worker.py", line 218 in 2022-02-28 09:19:57,744 WARNING worker.py:1257 -- The node with node id: b6ee2039e7fe0eb0a19fd13266454e317beaa0566b4c36d2475c4b58 and ip: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats. Windows fatal exception: access violation

Versions / Dependencies

Python 3.7 Windows 10 ray 1.9.2

Reproduction script

We are using ray for an object defect detection project .We send a defect detection command ervery 10s, and the program executes the defect identification code asynchronously and the defect detection time is 10-20s.The program running result will always report an error : WARNING worker.py:1257 -- The node with node id: b6ee2039e7fe0eb0a19fd13266454e317beaa0566b4c36d2475c4b58 and ip: 127.0.0.1 has been marked dead because the detector has missed many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats. Windows fatal exception: access violation

Anything else

No response

Are you willing to submit a PR?

rkooo567 commented 2 years ago

cc @pcmoritz @wuisawesome ? Is this error fixed in the latest version?

pcmoritz commented 2 years ago

@babyblue1334 Do you have a reproduction for the issue (e.g. a minimal example that reproduces this behavior)?

mattip commented 2 years ago

From the traceback, it seems like ray is trying to recover from an error around the use of edge_detect_time, which is, as far as I can tell, not a ray function. @babyblue1334 is that a correct analysis? If so, maybe the problem is not in ray.

alexsaurber commented 2 years ago

I have the same error as @babyblue1334 when trying to read images. I have sets (of varying sizes) of images that get read in parallel and combined. I get the GetLastError() = 1455 or GetLastError() = 1450 with similar traceback consistently on sets that have 600+ images, inconsistently on sets with 400-600 images, and not at all (yet) on sets with fewer than 400 images.

I have also confirmed that is is not the images themselves being corrupted and it is in fact a number problem by rerunning my script with duplicated data. The same set with twice the number of images (each one copied once) will also have this error. Although the traceback suggests a memory problem, manually allocating memory to the ray.remote decorator does not fix this. My machine has 32GB of RAM and usually has around 11 GB of standby memory left when reading images. It crashes on the ray.get() call. This can sometimes be mitigated with a ray.wait(), but not always.

mattip commented 2 years ago

@alexsaurber would it be possible to get a reproducer and the traceback?

alexsaurber commented 2 years ago

@mattip Sure thing. See below

I had to add extra images to make it crash, but I got it. My work computer has a 12-core CPU and 32GB RAM. My home computer has an 8-core cpu and 32GB RAM.

It took about 800-1000 images to crash my home computer.

The first part generates bogus images. The second part reads them and finds the max. I know the np.maximum section is not optimized, but I am not worried about that yet.

import cv2
import numpy as np
import ray
import glob

# Set up params
image_resolution = (4024,3036)

## Comment out this block if you already have data ##
# Generate images to use as examples.
ray.init()
@ray.remote
def make_data(i):
    img = np.uint8(np.random.randint(255,size=image_resolution))
    imgStr = f'img{str(i).zfill(3)}.tiff'
    saved = cv2.imwrite(f'test//{imgStr}',img)
    return saved

print('Generating Data')
save_ids = []
for i in range(1000):
    save_ids.append(make_data.remote(i))

# Block until data is saved
ray.get(save_ids)
## -- ##

# Set up function to open images
# ray.init()
@ray.remote
def open_images(path):
    img = np.empty(image_resolution,dtype='uint8')
    img = cv2.imread(path,cv2.IMREAD_GRAYSCALE)
    return img

# Get list of images in test folder that follow the format: img*.tiff
print('Generating Image List')
regex_list = glob.glob('test/img*.tiff')
num_images = len(regex_list)
# Read each file in the regex_list
print('Reading Images')
image_ids = []
for file in regex_list:
    image_ids.append(open_images.remote(file))

# Get list of images from ray object ids
ray.wait(image_ids,num_returns=(num_images-1))
print('ray.get ...')
image_list = ray.get(image_ids)

# Find max across each pixel location
print('Processing images')
result = np.zeros(image_resolution)
for image in image_list:
    result = np.maximum(result,image)

# Save result image
# cv2.imshow('result',result)
# cv2.waitKey(0)
cv2.imwrite('result.tiff',result)
print('Complete')

This is the stack trace I received on error:

(pid=) [2022-04-05 19:24:06,125 C 16288 16700] (raylet.exe) dlmalloc.cc:125:  Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1450
(pid=) *** StackTrace Information ***
(pid=)     configthreadlocale
(pid=)     BaseThreadInitThunk
(pid=)     RtlUserThreadStart
(pid=)
2022-04-05 19:24:34,964 WARNING worker.py:1326 -- The node with node id: 524d895f17658997724bc4681ec690f5b9b0c65a53b391a5c7b0b252 and ip: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
Windows fatal exception: access violation

From all of my research I have found that the answer is that it is a windows-specific issue. If I can find the issue that discussed a similar problem then I will ammend my post.

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

DanielAtt2000 commented 1 year ago

Having a similar issue. Using ray 2.1.0, 16GB Ram.

As inputs to my custom environment I have a Box of shape (7,1) floats and another Box of shape(14000,6), connected together with a Dict.

After around 28K timesteps the following errors crops up.

What actually causes the problem? Is there a possible solution to it?

(pid=) [2022-11-15 23:42:49,107 C 17284 3760] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1450 (pid=) *** StackTrace Information *** (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) unknown (pid=) configthreadlocale (pid=) BaseThreadInitThunk (pid=) RtlUserThreadStart

jonar-ch commented 1 year ago

Same issue here. Python3.9.13, Ray 2.3.1.

yolking commented 12 months ago

I was also getting 1450 error on specific large data given to ray.put while having a lot of free RAM without any obvious indications why. After looking through the logs got the idea that ray was writing objects to temp on disk anyway and that was the issue. Basically if you don't have enough free space on disk it will fail caching objects to temp and throw 1450 error no matter how much free RAM you have. I freed space on disk and problem is gone. It seems if input object is even bigger I again get an error. So since there is not enough memory in plasma storage the solution is to increase it's size. By default in my case ray was using 17gb storage, but I had 25gb object. Adding object_store_memory=26*1000*1000*1000 in ray.init solved it for me again.

Liquidmasl commented 3 months ago

I also get this error A LOT I dont see why this is happening at all

mattip commented 3 months ago

@Liquidmasl this issue is closed. It can have many causes: a lack of RAM, a lack of disk space, thread oversubscription. Please open a new issue with a minimum reproducer and a description of what resources your machine has.

Liquidmasl commented 3 months ago

@Liquidmasl this issue is closed. It can have many causes: a lack of RAM, a lack of disk space, thread oversubscription. Please open a new issue with a minimum reproducer and a description of what resources your machine has.

Yes, sorry, I got desperate as I did not find a solution or even a clear cause in google or the documentation

I did make a new issue here https://github.com/ray-project/ray/issues/46990