nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
362 stars 136 forks source link

[BUG]: Error running DFP Duo training pipeline in AWS p3.16xlarge instance #1805

Open dnandakumar-nv opened 3 months ago

dnandakumar-nv commented 3 months ago

Version

24.10

Which installation method(s) does this occur on?

Docker, Source

Describe the bug.

Unable to run DFP Duo Training Pipeline (https://github.com/nv-morpheus/Morpheus/blob/branch-24.10/examples/digital_fingerprinting/production/morpheus/dfp_duo_pipeline.py) in AWS virtual machine with the following config:

Seeing the following error:

IMG_7545

Which indicates to me that Morpheus is throwing an error because GPUs could be connected to more than one NUMA node.

// for each gpu in topology determine which numa node the gpu belongs
    // the number of entries in the SharedResourcesBitMap denotes the number of NUMA nodes that have at least one device
    for (const auto& [gpu_id, info] : topology.gpu_info())
    {
        NumaSet node_set;
        auto rc = hwloc_cpuset_to_nodeset(topology.handle(), &info.cpu_set().bitmap(), &node_set.bitmap());
        CHECK_NE(rc, -1);
        if (node_set.weight() != 0)
        {
            CHECK_EQ(node_set.weight(), 1);
            gpus_per_numa_node.insert(node_set, gpu_id);
        }
    }

I am unable to replicate the error on a DGX with one or multiple A100s attached to the container This issue persists on container build from source as well as NVAIE containers versions 24.06 and 24.03.

Minimum reproducible example

No response

Relevant log output

Click here to see error details

 [Paste the error here, it will be hidden by default]

Full env printout

Click here to see environment details

 [Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

morpheus-bot-test[bot] commented 3 months ago

Hi @dnandakumar-nv!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! In the meantime, feel free to add any relevant information to this issue.

efajardo-nv commented 2 months ago

Failed check (above referenced code) is here: https://github.com/nv-morpheus/MRC/blob/branch-24.10/cpp/mrc/src/internal/system/partitions.cpp#L83