google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://ai.google.dev/edge/mediapipe
Apache License 2.0
27.72k stars 5.18k forks source link

Non-deterministic segfault #2482

Closed orf closed 3 years ago

orf commented 3 years ago

Please make sure that this is a bug and also refer to the troubleshooting, FAQ documentation before raising any issues.

System information (Please provide as much relevant information as possible)

Describe the current behavior:

We have a custom face analysis pipeline (attached at the end), but have no custom C/C++ code. We are building MediaPipe on Debian Stretch and have made no customizations to the build process - we use the stock docker container to run setup.py bdist_wheel.

We read a video with pyav and pass it into the model, as below. We do this with a single thread, iterating over batches of 25 videos:

calculator = face_analysis.SingleFaceAnalysisCpu(
            origin_point_location=face_analysis.OriginPointLocation.TOP_LEFT_CORNER,
            vertical_fov_degrees=63.0,  # 63 degrees
            near=1.0,  # 1cm
            far=1000.0,  # 10m
        )

for video in videos:
    frames = get_frames(video)
    for frame in frames:
        calculator.process(np.asarray(frame.content))

The graph (and the .pbtxt single_face_analysis_cpu.txt):

mediapipe_simple_subgraph(
    name = "single_face_analysis_cpu",
    graph = "single_face_analysis_cpu.pbtxt",
    register_as = "SingleFaceAnalysisCpu",
    deps = [
        "//mediapipe/calculators/core:concatenate_vector_calculator",
        "//mediapipe/calculators/core:constant_side_packet_calculator",
        "//mediapipe/calculators/core:split_landmarks_calculator",
        "//mediapipe/calculators/core:split_vector_calculator",
        "//mediapipe/calculators/image:image_properties_calculator",
        "//mediapipe/modules/face_detection:face_detection_short_range_cpu",
        "//mediapipe/modules/face_geometry:face_geometry_from_landmarks",
        "//mediapipe/modules/face_landmark:face_detection_front_detection_to_roi",
        "//mediapipe/modules/face_landmark:face_landmark_cpu",
        "//mediapipe/modules/iris_landmark:iris_landmark_left_and_right_cpu",
        "//mediapipe/graphs/iris_tracking/calculators:update_face_landmarks_calculator",
    ],
)

When we run this over a large set of videos on our cluster we are seeing spurious segfaults:

Fatal Python error: Segmentation fault

Thread 0x00007f3c6f5fe700 (most recent call first):
File "/root/.cache/pypoetry/virtualenvs/src-LSA000g6-py3.8/lib/python3.8/site-packages/mediapipe/python/solution_base.py", line 334 in process
...

I am still trying to correlate these segfaults to particular inputs, but I cannot seem to do so. The segfault is not reproducible at all. The only hint I have is that there is sometimes a Python garbage collection before the segfault:

INFO - full garbage collection released 42.54 MiB from 751 reference cycles (threshold: 9.54 MiB)

However it does seem that the same set of 25 videos, when passed through process() in sequence, will sometimes cause a segfault whereas other sets of 25 videos will not. However the exact video that causes the segfault is non-deterministic, and it is sometimes the first video of the set.

Describe the expected behavior:

Standalone code to reproduce the issue:

I cannot provide this. The videos we are processing are also of a personal nature and cannot be uploaded here. However I am willing to provide any additional information that might be required, including core-dumps if I can send them privately.

sgowroji commented 3 years ago

Hi @orf, Could you please share about the solution you are referring in the above use case. Thanks!

orf commented 3 years ago

Hi @orf, Could you please share about the solution you are referring in the above use case. Thanks!

Can you explain what you exactly want? I did not write the custom solution so my knowledge is limited. I am only observing the segfaults. I’ve attached all the customisations to the original post?

it seems most of the segfaults come from the “wait for idle” method, but some come when feeding data into the graph.

re-creating the graph for every inference does not reduce segfaults.

orf commented 3 years ago

This appears to be related to https://github.com/google/mediapipe/issues/2250, but we are using version 0.8.7.1.

sgowroji commented 3 years ago

Hi @orf, Are you trying to build the Selfie segmentation python solution?

jiuqiant commented 3 years ago

The GIL related fix is now only in the official Python v0.8.7.1 binaries, which are available in PyPI. The source code will go out in the next release.

orf commented 3 years ago

Thank you! That makes sense, sorry I thought that the fix was included in the current code on GitHub. Can you share any ETAs on when the fix will be pushed?

jiuqiant commented 3 years ago

The fix is released in https://github.com/google/mediapipe/commit/6abec128edd6d037e1a988605a59957c22f1e967. Please pull the latest version and build the python package again. Thanks.

sgowroji commented 3 years ago

Hi @orf, Did you get a chance to pull the latest version as mentioned in the above comment. Thanks!

google-ml-butler[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 3 years ago

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No