Open dnandakumar-nv opened 3 months ago
Hi @dnandakumar-nv!
Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! In the meantime, feel free to add any relevant information to this issue.
Failed check (above referenced code) is here: https://github.com/nv-morpheus/MRC/blob/branch-24.10/cpp/mrc/src/internal/system/partitions.cpp#L83
Version
24.10
Which installation method(s) does this occur on?
Docker, Source
Describe the bug.
Unable to run DFP Duo Training Pipeline (https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/examples/digital_fingerprinting/production/morpheus/dfp_duo_pipeline.py) in AWS virtual machine with the following config:
Seeing the following error:
Which indicates to me that Morpheus is throwing an error because GPUs could be connected to more than one NUMA node.
I am unable to replicate the error on a DGX with one or multiple A100s attached to the container This issue persists on container build from source as well as NVAIE containers versions 24.06 and 24.03.
Minimum reproducible example
No response
Relevant log output
Click here to see error details
Full env printout
Click here to see environment details
Other/Misc.
No response
Code of Conduct