microsoft / microxcaling

PyTorch emulation library for Microscaling (MX)-compatible data formats
MIT License
164 stars 21 forks source link

Activations unit tests (gelu, tanh) hang on both A10 and A100 gpus... #8

Closed lessw2020 closed 1 year ago

lessw2020 commented 1 year ago

I ran the unit test suites but this resulted in hangs...debugging a bit, I isolated it to the test_activations.py and specifically gelu and tanh unit tests.

a100_test_gelu_hang

and

mx_tanh_hang

I thought perhaps it was due to running on A10 gpus, so I changed to A100 but hit the exact same hangs.

I'm running on latest PT nightlies, not sure if that has an impact here or not. From manually running the other tests relevant for LLM's, these all passed with no issues.

rizhao-msft commented 1 year ago

It looks like the first GPU unit test that uses the custom CUDA code is hanging - usually the first such test takes a minute because it builds the CUDA code as a Pytorch extension. If it hangs for more than a minute this could be an error building that code.

Try doing Ctrl+C when it hangs, then scroll all the way up (there's going to be a bunch of pytest messages, but the first message will have the CUDA build error if it is a build issue). There's likely a bunch of CUDA warnings too.

Can you make sure you have the ninja package installed?

lessw2020 commented 1 year ago

Hi @rizhao-msft - thanks for the info above.

I found that running with a single gpu (A100) then the tests all pass with just some warnings about pkg namespace:

test_activations.py::test_gelu[cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py:28: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import packaging  # type: ignore[attr-defined]

test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.cloud')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2350: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(parent)

test_activations.py::test_gelu[cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.logging')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

test_activations.py::test_gelu[cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 186 passed, 16 warnings in 33.01s
lessw2020 commented 1 year ago

I do have ninja installed on both servers. Running with 4 or 8 gpus, I get the hang. using the ctrl+C break you mentioned results in this trace:


(pytorch) ubuntu@ip-172-31-66-198:~/microxcaling/mx/tests$ pytest test_activations.py -vs --full-trace
================================================================== test session starts ==================================================================
platform linux -- Python 3.9.16, pytest-7.4.2, pluggy-1.2.0 -- /opt/conda/envs/pytorch/bin/python3.9
cachedir: .pytest_cache
rootdir: /home/ubuntu/microxcaling/mx/tests
plugins: anyio-3.6.2, mock-3.8.2, cov-4.1.0
collected 144 items                                                                                                                                     

test_activations.py::test_activation[cpu-False-True-10-tanh-tanh] PASSED
test_activations.py::test_activation[cpu-False-True-10-relu-relu] PASSED
test_activations.py::test_activation[cpu-False-True-10-relu6-relu6] PASSED
test_activations.py::test_activation[cpu-False-True-10-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cpu-False-True-10-silu-silu] PASSED
test_activations.py::test_activation[cpu-False-True-10000-tanh-tanh] PASSED
test_activations.py::test_activation[cpu-False-True-10000-relu-relu] PASSED
test_activations.py::test_activation[cpu-False-True-10000-relu6-relu6] PASSED
test_activations.py::test_activation[cpu-False-True-10000-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cpu-False-True-10000-silu-silu] PASSED
test_activations.py::test_activation[cpu-False-False-10-tanh-tanh] PASSED
test_activations.py::test_activation[cpu-False-False-10-relu-relu] PASSED
test_activations.py::test_activation[cpu-False-False-10-relu6-relu6] PASSED
test_activations.py::test_activation[cpu-False-False-10-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cpu-False-False-10-silu-silu] PASSED
test_activations.py::test_activation[cpu-False-False-10000-tanh-tanh] PASSED
test_activations.py::test_activation[cpu-False-False-10000-relu-relu] PASSED
test_activations.py::test_activation[cpu-False-False-10000-relu6-relu6] PASSED
test_activations.py::test_activation[cpu-False-False-10000-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cpu-False-False-10000-silu-silu] PASSED
test_activations.py::test_activation[cuda-False-True-10-tanh-tanh] PASSED
test_activations.py::test_activation[cuda-False-True-10-relu-relu] PASSED
test_activations.py::test_activation[cuda-False-True-10-relu6-relu6] PASSED
test_activations.py::test_activation[cuda-False-True-10-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cuda-False-True-10-silu-silu] PASSED
test_activations.py::test_activation[cuda-False-True-10000-tanh-tanh] PASSED
test_activations.py::test_activation[cuda-False-True-10000-relu-relu] PASSED
test_activations.py::test_activation[cuda-False-True-10000-relu6-relu6] PASSED
test_activations.py::test_activation[cuda-False-True-10000-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cuda-False-True-10000-silu-silu] PASSED
test_activations.py::test_activation[cuda-False-False-10-tanh-tanh] PASSED
test_activations.py::test_activation[cuda-False-False-10-relu-relu] PASSED
test_activations.py::test_activation[cuda-False-False-10-relu6-relu6] PASSED
test_activations.py::test_activation[cuda-False-False-10-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cuda-False-False-10-silu-silu] PASSED
test_activations.py::test_activation[cuda-False-False-10000-tanh-tanh] PASSED
test_activations.py::test_activation[cuda-False-False-10000-relu-relu] PASSED
test_activations.py::test_activation[cuda-False-False-10000-relu6-relu6] PASSED
test_activations.py::test_activation[cuda-False-False-10000-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cuda-False-False-10000-silu-silu] PASSED
test_activations.py::test_activation[cuda-True-True-10-tanh-tanh] ^C

=================================================================== warnings summary ====================================================================
test_activations.py::test_activation[cuda-True-True-10-tanh-tanh]
  /opt/conda/envs/pytorch/lib/python3.9/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
    warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)

test_activations.py::test_activation[cuda-True-True-10-tanh-tanh]
  /opt/conda/envs/pytorch/lib/python3.9/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

test_activations.py::test_activation[cuda-True-True-10-tanh-tanh]
  /opt/conda/envs/pytorch/lib/python3.9/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

config = <_pytest.config.Config object at 0x7f0dd3a880d0>, doit = <function _main at 0x7f0dd340bca0>

    def wrap_session(
        config: Config, doit: Callable[[Config, "Session"], Optional[Union[int, ExitCode]]]
    ) -> Union[int, ExitCode]:
        """Skeleton command line program."""
        session = Session.from_config(config)
        session.exitstatus = ExitCode.OK
        initstate = 0
        try:
            try:
                config._do_configure()
                initstate = 1
                config.hook.pytest_sessionstart(session=session)
                initstate = 2
>               session.exitstatus = doit(config, session) or 0

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/main.py:271: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

config = <_pytest.config.Config object at 0x7f0dd3a880d0>, session = <Session tests exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=144>

    def _main(config: Config, session: "Session") -> Optional[Union[int, ExitCode]]:
        """Default command line protocol for initialization, session,
        running tests and reporting."""
        config.hook.pytest_collection(session=session)
>       config.hook.pytest_runtestloop(session=session)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/main.py:325: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_runtestloop'>, kwargs = {'session': <Session tests exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=144>}
firstresult = True

    def __call__(self, **kwargs: object) -> Any:
        assert (
            not self.is_historic()
        ), "Cannot directly call a historic hook - use call_historic instead."
        self._verify_all_args_are_provided(kwargs)
        firstresult = self.spec.opts.get("firstresult", False) if self.spec else False
>       return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_hooks.py:433: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x7f0dd31240a0>, hook_name = 'pytest_runtestloop'
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/opt/conda/envs/pytorch/lib/python3.9/site-packages...t/main.py'>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f0dd2fd4070>>]
kwargs = {'session': <Session tests exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=144>}, firstresult = True

    def _hookexec(
        self,
        hook_name: str,
        methods: Sequence[HookImpl],
        kwargs: Mapping[str, object],
        firstresult: bool,
    ) -> object | list[object]:
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook_name, methods, kwargs, firstresult)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_manager.py:112: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

session = <Session tests exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=144>

    def pytest_runtestloop(session: "Session") -> bool:
        if session.testsfailed and not session.config.option.continue_on_collection_errors:
            raise session.Interrupted(
                "%d error%s during collection"
                % (session.testsfailed, "s" if session.testsfailed != 1 else "")
            )

        if session.config.option.collectonly:
            return True

        for i, item in enumerate(session.items):
            nextitem = session.items[i + 1] if i + 1 < len(session.items) else None
>           item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/main.py:350: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_runtest_protocol'>
kwargs = {'item': <Function test_activation[cuda-True-True-10-tanh-tanh]>, 'nextitem': <Function test_activation[cuda-True-True-10-relu-relu]>}
firstresult = True

    def __call__(self, **kwargs: object) -> Any:
        assert (
            not self.is_historic()
        ), "Cannot directly call a historic hook - use call_historic instead."
        self._verify_all_args_are_provided(kwargs)
        firstresult = self.spec.opts.get("firstresult", False) if self.spec else False
>       return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_hooks.py:433: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x7f0dd31240a0>, hook_name = 'pytest_runtest_protocol'
methods = [<HookImpl plugin_name='runner', plugin=<module '_pytest.runner' from '/opt/conda/envs/pytorch/lib/python3.9/site-pack...s', plugin=<module '_pytest.warnings' from '/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/warnings.py'>>]
kwargs = {'item': <Function test_activation[cuda-True-True-10-tanh-tanh]>, 'nextitem': <Function test_activation[cuda-True-True-10-relu-relu]>}
firstresult = True

    def _hookexec(
        self,
        hook_name: str,
        methods: Sequence[HookImpl],
        kwargs: Mapping[str, object],
        firstresult: bool,
    ) -> object | list[object]:
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook_name, methods, kwargs, firstresult)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_manager.py:112: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

item = <Function test_activation[cuda-True-True-10-tanh-tanh]>, nextitem = <Function test_activation[cuda-True-True-10-relu-relu]>

    def pytest_runtest_protocol(item: Item, nextitem: Optional[Item]) -> bool:
        ihook = item.ihook
        ihook.pytest_runtest_logstart(nodeid=item.nodeid, location=item.location)
>       runtestprotocol(item, nextitem=nextitem)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

item = <Function test_activation[cuda-True-True-10-tanh-tanh]>, log = True, nextitem = <Function test_activation[cuda-True-True-10-relu-relu]>

    def runtestprotocol(
        item: Item, log: bool = True, nextitem: Optional[Item] = None
    ) -> List[TestReport]:
        hasrequest = hasattr(item, "_request")
        if hasrequest and not item._request:  # type: ignore[attr-defined]
            # This only happens if the item is re-run, as is done by
            # pytest-rerunfailures.
            item._initrequest()  # type: ignore[attr-defined]
        rep = call_and_report(item, "setup", log)
        reports = [rep]
        if rep.passed:
            if item.config.getoption("setupshow", False):
                show_test_item(item)
            if not item.config.getoption("setuponly", False):
>               reports.append(call_and_report(item, "call", log))

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:133: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

item = <Function test_activation[cuda-True-True-10-tanh-tanh]>, when = 'call', log = True, kwds = {}

    def call_and_report(
        item: Item, when: "Literal['setup', 'call', 'teardown']", log: bool = True, **kwds
    ) -> TestReport:
>       call = call_runtest_hook(item, when, **kwds)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:222: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

item = <Function test_activation[cuda-True-True-10-tanh-tanh]>, when = 'call', kwds = {}
reraise = (<class '_pytest.outcomes.Exit'>, <class 'KeyboardInterrupt'>)

    def call_runtest_hook(
        item: Item, when: "Literal['setup', 'call', 'teardown']", **kwds
    ) -> "CallInfo[None]":
        if when == "setup":
            ihook: Callable[..., None] = item.ihook.pytest_runtest_setup
        elif when == "call":
            ihook = item.ihook.pytest_runtest_call
        elif when == "teardown":
            ihook = item.ihook.pytest_runtest_teardown
        else:
            assert False, f"Unhandled runtest hook case: {when}"
        reraise: Tuple[Type[BaseException], ...] = (Exit,)
        if not item.config.getoption("usepdb", False):
            reraise += (KeyboardInterrupt,)
>       return CallInfo.from_call(
            lambda: ihook(item=item, **kwds), when=when, reraise=reraise
        )

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:261: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class '_pytest.runner.CallInfo'>, func = <function call_runtest_hook.<locals>.<lambda> at 0x7f0d312a2940>, when = 'call'
reraise = (<class '_pytest.outcomes.Exit'>, <class 'KeyboardInterrupt'>)

    @classmethod
    def from_call(
        cls,
        func: "Callable[[], TResult]",
        when: "Literal['collect', 'setup', 'call', 'teardown']",
        reraise: Optional[
            Union[Type[BaseException], Tuple[Type[BaseException], ...]]
        ] = None,
    ) -> "CallInfo[TResult]":
        """Call func, wrapping the result in a CallInfo.

        :param func:
            The function to call. Called without arguments.
        :param when:
            The phase in which the function is called.
        :param reraise:
            Exception or exceptions that shall propagate if raised by the
            function, instead of being wrapped in the CallInfo.
        """
        excinfo = None
        start = timing.time()
        precise_start = timing.perf_counter()
        try:
>           result: Optional[TResult] = func()

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:341: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>       lambda: ihook(item=item, **kwds), when=when, reraise=reraise
    )

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:262: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_runtest_call'>, kwargs = {'item': <Function test_activation[cuda-True-True-10-tanh-tanh]>}, firstresult = False

    def __call__(self, **kwargs: object) -> Any:
        assert (
            not self.is_historic()
        ), "Cannot directly call a historic hook - use call_historic instead."
        self._verify_all_args_are_provided(kwargs)
        firstresult = self.spec.opts.get("firstresult", False) if self.spec else False
>       return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_hooks.py:433: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x7f0dd31240a0>, hook_name = 'pytest_runtest_call'
methods = [<HookImpl plugin_name='runner', plugin=<module '_pytest.runner' from '/opt/conda/envs/pytorch/lib/python3.9/site-pack...dule '_pytest.threadexception' from '/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/threadexception.py'>>]
kwargs = {'item': <Function test_activation[cuda-True-True-10-tanh-tanh]>}, firstresult = False

    def _hookexec(
        self,
        hook_name: str,
        methods: Sequence[HookImpl],
        kwargs: Mapping[str, object],
        firstresult: bool,
    ) -> object | list[object]:
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook_name, methods, kwargs, firstresult)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_manager.py:112: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

item = <Function test_activation[cuda-True-True-10-tanh-tanh]>

    def pytest_runtest_call(item: Item) -> None:
        _update_current_test_var(item, "call")
        try:
            del sys.last_type
            del sys.last_value
            del sys.last_traceback
        except AttributeError:
            pass
        try:
>           item.runtest()

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:169: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Function test_activation[cuda-True-True-10-tanh-tanh]>

    def runtest(self) -> None:
        """Execute the underlying test function."""
>       self.ihook.pytest_pyfunc_call(pyfuncitem=self)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/python.py:1792: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_pyfunc_call'>, kwargs = {'pyfuncitem': <Function test_activation[cuda-True-True-10-tanh-tanh]>}, firstresult = True

    def __call__(self, **kwargs: object) -> Any:
        assert (
            not self.is_historic()
        ), "Cannot directly call a historic hook - use call_historic instead."
        self._verify_all_args_are_provided(kwargs)
        firstresult = self.spec.opts.get("firstresult", False) if self.spec else False
>       return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_hooks.py:433: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x7f0dd31240a0>, hook_name = 'pytest_pyfunc_call'
methods = [<HookImpl plugin_name='python', plugin=<module '_pytest.python' from '/opt/conda/envs/pytorch/lib/python3.9/site-pack...ugin=<module 'anyio.pytest_plugin' from '/opt/conda/envs/pytorch/lib/python3.9/site-packages/anyio/pytest_plugin.py'>>]
kwargs = {'pyfuncitem': <Function test_activation[cuda-True-True-10-tanh-tanh]>}, firstresult = True

    def _hookexec(
        self,
        hook_name: str,
        methods: Sequence[HookImpl],
        kwargs: Mapping[str, object],
        firstresult: bool,
    ) -> object | list[object]:
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook_name, methods, kwargs, firstresult)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_manager.py:112: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

pyfuncitem = <Function test_activation[cuda-True-True-10-tanh-tanh]>

    @hookimpl(trylast=True)
    def pytest_pyfunc_call(pyfuncitem: "Function") -> Optional[object]:
        testfunction = pyfuncitem.obj
        if is_async_function(testfunction):
            async_warn_and_skip(pyfuncitem.nodeid)
        funcargs = pyfuncitem.funcargs
        testargs = {arg: funcargs[arg] for arg in pyfuncitem._fixtureinfo.argnames}
>       result = testfunction(**testargs)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/python.py:194: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

f1 = <built-in method tanh of type object at 0x7f0dd1982040>, f2 = <function tanh at 0x7f0d312cbee0>, size = 10, quantize_backprop = True
device = 'cuda', custom_cuda = True

    @pytest.mark.parametrize("f1, f2", [
        # (torch.sigmoid, sigmoid),
        (torch.tanh,    tanh),
        (F.relu,        relu),
        (F.relu6,       relu6),
        (F.leaky_relu,  leaky_relu),
        (F.silu,        silu),
    ])
    @pytest.mark.parametrize("size", SIZE)
    @pytest.mark.parametrize("quantize_backprop", [True, False])
    @pytest.mark.parametrize("device, custom_cuda", DEVICE__CUSTOM_CUDA)
    def test_activation(f1, f2, size, quantize_backprop, device, custom_cuda):
        # mx specs. Use a large bitwidth since we're testing
        # algorithmic correctness, not precision
        mx_specs = apply_mx_specs(None)
        mx_specs['bfloat'] = 30
        mx_specs['quantize_backprop'] = quantize_backprop
        mx_specs['custom_cuda'] = custom_cuda
        kwargs = {'negative_slope': 0.4} if f2 is leaky_relu else {}

        # Create shared input for two networks
        m_ = np.random.randn(size)

        m1 = torch.tensor(m_, dtype=torch.float32, device=device, requires_grad=True)
        m2 = torch.tensor(m_, dtype=torch.float32, device=device, requires_grad=True)

        q1 = f1(m1, **kwargs)
        loss1 = (q1**2).sum()
        loss1.backward()
        torch.cuda.synchronize()

>       q2 = f2(m2, mx_specs=mx_specs, **kwargs)

test_activations.py:106: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

input = tensor([-0.0311, -0.0383,  1.2802,  1.1709,  0.1558,  0.3154, -0.3951,  0.4267,
        -0.8888,  0.4220], device='cuda:0', requires_grad=True)
mx_specs = {'scale_bits': 0, 'w_elem_format': None, 'a_elem_format': None, 'w_elem_format_bp': None, 'a_elem_format_bp_ex': None,...put_grad_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}
name = None

    def tanh(input, mx_specs=None, name=None):
        mx_assert_test(mx_specs)
        if mx_specs is None:
            return torch.tanh(input)

        mx_specs = apply_mx_specs(mx_specs)
>       return TanhFunction.apply(input, mx_specs, name)

../activations.py:32: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'mx.activations.TanhFunction'>
args = (tensor([-0.0311, -0.0383,  1.2802,  1.1709,  0.1558,  0.3154, -0.3951,  0.4267,
        -0.8888,  0.4220], device='cu...d_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}, None)
kwargs = {}, bind_default_args = <function Function.apply.<locals>.bind_default_args at 0x7f0d312a2dc0>, is_setup_ctx_defined = False

    @classmethod
    def apply(cls, *args, **kwargs):
        def bind_default_args(func, *args, **kwargs):
            signature = inspect.signature(func)
            bound_args = signature.bind(*args, **kwargs)
            bound_args.apply_defaults()

            return bound_args.args

        is_setup_ctx_defined = cls.setup_context != _SingleLevelFunction.setup_context
        if is_setup_ctx_defined:
            args = bind_default_args(cls.forward, *args, **kwargs)

        if not torch._C._are_functorch_transforms_active():
            # See NOTE: [functorch vjp and autograd interaction]
            args = _functorch.utils.unwrap_dead_wrappers(args)
>           return super().apply(*args, **kwargs)  # type: ignore[misc]

/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:551: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

ctx = <torch.autograd.function.TanhFunctionBackward object at 0x7f0d312a4740>
input = tensor([-0.0311, -0.0383,  1.2802,  1.1709,  0.1558,  0.3154, -0.3951,  0.4267,
        -0.8888,  0.4220], device='cuda:0', requires_grad=True)
mx_specs = {'scale_bits': 0, 'w_elem_format': None, 'a_elem_format': None, 'w_elem_format_bp': None, 'a_elem_format_bp_ex': None,...put_grad_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}
name = None

    @staticmethod
    def forward(ctx, input, mx_specs=None, name=None):
        ctx.name = name

>       input       = vec_quantize(input, mx_specs=mx_specs)

../activations.py:257: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

input = tensor([-0.0311, -0.0383,  1.2802,  1.1709,  0.1558,  0.3154, -0.3951,  0.4267,
        -0.8888,  0.4220], device='cuda:0', requires_grad=True)
mx_specs = {'scale_bits': 0, 'w_elem_format': None, 'a_elem_format': None, 'w_elem_format_bp': None, 'a_elem_format_bp_ex': None,...put_grad_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}
round = None

    def vec_quantize(input, mx_specs=None, round=None):
>       return quantize_elemwise_op(input, mx_specs=mx_specs,
                                    round=round)

../vector_ops.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([-0.0311, -0.0383,  1.2802,  1.1709,  0.1558,  0.3154, -0.3951,  0.4267,
        -0.8888,  0.4220], device='cuda:0', requires_grad=True)
mx_specs = {'scale_bits': 0, 'w_elem_format': None, 'a_elem_format': None, 'w_elem_format_bp': None, 'a_elem_format_bp_ex': None,...put_grad_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}
round = 'nearest'

    def quantize_elemwise_op(A, mx_specs, round=None):
        """A function used for element-wise quantization with mx_specs
        Arguments:
          A          {PyTorch tensor} -- a tensor that needs to be quantized
          mx_specs {dictionary}     -- dictionary to specify mx_specs
          round      {str}            -- Rounding mode, choose from (floor, nearest, even)
                                         (default: "nearest")
        Returns:
          quantized value {PyTorch tensor} -- a tensor that has been quantized
        """
        if mx_specs is None:
            return A
        elif round is None:
            round = mx_specs['round']

        if mx_specs['bfloat'] > 0 and mx_specs['fp'] > 0:
            raise ValueError("Cannot set both [bfloat] and [fp] in mx_specs.")
        elif mx_specs['bfloat'] > 9:
>           A = _quantize_bfloat(A, bfloat=mx_specs['bfloat'], round=round,
                                 custom_cuda=mx_specs['custom_cuda'],

../elemwise_ops.py:253: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([-0.0311, -0.0383,  1.2802,  1.1709,  0.1558,  0.3154, -0.3951,  0.4267,
        -0.8888,  0.4220], device='cuda:0', requires_grad=True)
bfloat = 30, round = 'nearest', custom_cuda = True, allow_denorm = True

    def _quantize_bfloat(A, bfloat, round='nearest', custom_cuda=False, allow_denorm=True):
        """ Quantize values to bfloatX format
        Arguments:
          bfloat      {int}       -- Total number of bits for bfloatX format,
                                     Includes 1 sign, 8 exp bits, and variable
                                     mantissa bits. Must be >= 9.
        """
        # Shortcut for no quantization
        if bfloat == 0 or bfloat == 32:
            return A

        max_norm = _get_max_norm(8, bfloat-7)

>       return _quantize_elemwise_core(
                A, bits=bfloat-7, exp_bits=8, max_norm=max_norm, round=round,
                allow_denorm=allow_denorm, custom_cuda=custom_cuda)

../elemwise_ops.py:206: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([-0.0311, -0.0383,  1.2802,  1.1709,  0.1558,  0.3154, -0.3951,  0.4267,
        -0.8888,  0.4220], device='cuda:0', requires_grad=True)
bits = 23, exp_bits = 8, max_norm = 3.4028228579130005e+38, round = 'nearest', saturate_normals = False, allow_denorm = True, custom_cuda = True

    def _quantize_elemwise_core(A, bits, exp_bits, max_norm, round='nearest',
                                saturate_normals=False, allow_denorm=True,
                                custom_cuda=False):
        """ Core function used for element-wise quantization
        Arguments:
          A         {PyTorch tensor} -- A tensor to be quantized
          bits      {int}            -- Number of mantissa bits. Includes
                                        sign bit and implicit one for floats
          exp_bits  {int}            -- Number of exponent bits, 0 for ints
          max_norm  {float}          -- Largest representable normal number
          round     {str}            -- Rounding mode: (floor, nearest, even)
          saturate_normals {bool}    -- If True, normal numbers (i.e., not NaN/Inf)
                                        that exceed max norm are clamped.
                                        Must be True for correct MX conversion.
          allow_denorm     {bool}    -- If False, flush denorm numbers in the
                                        elem_format to zero.
          custom_cuda      {str}     -- If True, use custom CUDA kernels
        Returns:
          quantized tensor {PyTorch tensor} -- A tensor that has been quantized
        """
        A_is_sparse = A.is_sparse
        if A_is_sparse:
            if A.layout != torch.sparse_coo:
                raise NotImplementedError("Only COO layout sparse tensors are currently supported.")

            sparse_A = A.coalesce()
            A = sparse_A.values().clone()

        # custom cuda only support floor and nearest rounding modes
        custom_cuda = custom_cuda and round in RoundingMode.string_enums()

        if custom_cuda:
            A = A.contiguous()

>           from . import custom_extensions

../elemwise_ops.py:118: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    """
    Copyright (c) Microsoft Corporation.
    Licensed under the MIT License.

    Python interface for custom CUDA implementations of functions.
    """

    import os
    from torch.utils.cpp_extension import load

    sources = [
        "funcs.cpp",
        "mx.cu",
        "elemwise.cu",
        "reduce.cu",
    ]
    file_dir = os.path.dirname(__file__)
    sources = [os.path.join(file_dir, "cpp", x) for x in sources]
>   funcs = load(name="custom_extensions", sources=sources)

../custom_extensions.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

name = 'custom_extensions'
sources = ['/home/ubuntu/microxcaling/mx/cpp/funcs.cpp', '/home/ubuntu/microxcaling/mx/cpp/mx.cu', '/home/ubuntu/microxcaling/mx/cpp/elemwise.cu', '/home/ubuntu/microxcaling/mx/cpp/reduce.cu']
extra_cflags = None, extra_cuda_cflags = None, extra_ldflags = None, extra_include_paths = None, build_directory = None, verbose = False
with_cuda = None, is_python_module = True, is_standalone = False, keep_intermediates = True

    def load(name,
             sources: Union[str, List[str]],
             extra_cflags=None,
             extra_cuda_cflags=None,
             extra_ldflags=None,
             extra_include_paths=None,
             build_directory=None,
             verbose=False,
             with_cuda: Optional[bool] = None,
             is_python_module=True,
             is_standalone=False,
             keep_intermediates=True):
        r'''
        Loads a PyTorch C++ extension just-in-time (JIT).

        To load an extension, a Ninja build file is emitted, which is used to
        compile the given sources into a dynamic library. This library is
        subsequently loaded into the current Python process as a module and
        returned from this function, ready for use.

        By default, the directory to which the build file is emitted and the
        resulting library compiled to is ``<tmp>/torch_extensions/<name>``, where
        ``<tmp>`` is the temporary folder on the current platform and ``<name>``
        the name of the extension. This location can be overridden in two ways.
        First, if the ``TORCH_EXTENSIONS_DIR`` environment variable is set, it
        replaces ``<tmp>/torch_extensions`` and all extensions will be compiled
        into subfolders of this directory. Second, if the ``build_directory``
        argument to this function is supplied, it overrides the entire path, i.e.
        the library will be compiled into that folder directly.

        To compile the sources, the default system compiler (``c++``) is used,
        which can be overridden by setting the ``CXX`` environment variable. To pass
        additional arguments to the compilation process, ``extra_cflags`` or
        ``extra_ldflags`` can be provided. For example, to compile your extension
        with optimizations, pass ``extra_cflags=['-O3']``. You can also use
        ``extra_cflags`` to pass further include directories.

        CUDA support with mixed compilation is provided. Simply pass CUDA source
        files (``.cu`` or ``.cuh``) along with other sources. Such files will be
        detected and compiled with nvcc rather than the C++ compiler. This includes
        passing the CUDA lib64 directory as a library directory, and linking
        ``cudart``. You can pass additional flags to nvcc via
        ``extra_cuda_cflags``, just like with ``extra_cflags`` for C++. Various
        heuristics for finding the CUDA install directory are used, which usually
        work fine. If not, setting the ``CUDA_HOME`` environment variable is the
        safest option.

        Args:
            name: The name of the extension to build. This MUST be the same as the
                name of the pybind11 module!
            sources: A list of relative or absolute paths to C++ source files.
            extra_cflags: optional list of compiler flags to forward to the build.
            extra_cuda_cflags: optional list of compiler flags to forward to nvcc
                when building CUDA sources.
            extra_ldflags: optional list of linker flags to forward to the build.
            extra_include_paths: optional list of include directories to forward
                to the build.
            build_directory: optional path to use as build workspace.
            verbose: If ``True``, turns on verbose logging of load steps.
            with_cuda: Determines whether CUDA headers and libraries are added to
                the build. If set to ``None`` (default), this value is
                automatically determined based on the existence of ``.cu`` or
                ``.cuh`` in ``sources``. Set it to `True`` to force CUDA headers
                and libraries to be included.
            is_python_module: If ``True`` (default), imports the produced shared
                library as a Python module. If ``False``, behavior depends on
                ``is_standalone``.
            is_standalone: If ``False`` (default) loads the constructed extension
                into the process as a plain dynamic library. If ``True``, build a
                standalone executable.

        Returns:
            If ``is_python_module`` is ``True``:
                Returns the loaded PyTorch extension as a Python module.

            If ``is_python_module`` is ``False`` and ``is_standalone`` is ``False``:
                Returns nothing. (The shared library is loaded into the process as
                a side effect.)

            If ``is_standalone`` is ``True``.
                Return the path to the executable. (On Windows, TORCH_LIB_PATH is
                added to the PATH environment variable as a side effect.)

        Example:
            >>> # xdoctest: +SKIP
            >>> from torch.utils.cpp_extension import load
            >>> module = load(
            ...     name='extension',
            ...     sources=['extension.cpp', 'extension_kernel.cu'],
            ...     extra_cflags=['-O2'],
            ...     verbose=True)
        '''
>       return _jit_compile(
            name,
            [sources] if isinstance(sources, str) else sources,
            extra_cflags,
            extra_cuda_cflags,
            extra_ldflags,
            extra_include_paths,
            build_directory or _get_build_directory(name, verbose),
            verbose,
            with_cuda,
            is_python_module,
            is_standalone,
            keep_intermediates=keep_intermediates)

/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1308: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

name = 'custom_extensions'
sources = ['/home/ubuntu/microxcaling/mx/cpp/funcs.cpp', '/home/ubuntu/microxcaling/mx/cpp/mx.cu', '/home/ubuntu/microxcaling/mx/cpp/elemwise.cu', '/home/ubuntu/microxcaling/mx/cpp/reduce.cu']
extra_cflags = None, extra_cuda_cflags = None, extra_ldflags = None, extra_include_paths = None
build_directory = '/home/ubuntu/.cache/torch_extensions/py39_cu121/custom_extensions', verbose = False, with_cuda = True, is_python_module = True
is_standalone = False, keep_intermediates = True

    def _jit_compile(name,
                     sources,
                     extra_cflags,
                     extra_cuda_cflags,
                     extra_ldflags,
                     extra_include_paths,
                     build_directory: str,
                     verbose: bool,
                     with_cuda: Optional[bool],
                     is_python_module,
                     is_standalone,
                     keep_intermediates=True) -> None:
        if is_python_module and is_standalone:
            raise ValueError("`is_python_module` and `is_standalone` are mutually exclusive.")

        if with_cuda is None:
            with_cuda = any(map(_is_cuda_file, sources))
        with_cudnn = any('cudnn' in f for f in extra_ldflags or [])
        old_version = JIT_EXTENSION_VERSIONER.get_version(name)
        version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
            name,
            sources,
            build_arguments=[extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths],
            build_directory=build_directory,
            with_cuda=with_cuda,
            is_python_module=is_python_module,
            is_standalone=is_standalone,
        )
        if version > 0:
            if version != old_version and verbose:
                print(f'The input conditions for extension module {name} have changed. ' +
                      f'Bumping to version {version} and re-building as {name}_v{version}...',
                      file=sys.stderr)
            name = f'{name}_v{version}'

        if version != old_version:
            baton = FileBaton(os.path.join(build_directory, 'lock'))
            if baton.try_acquire():
                try:
                    with GeneratedFileCleaner(keep_intermediates=keep_intermediates) as clean_ctx:
                        if IS_HIP_EXTENSION and (with_cuda or with_cudnn):
                            hipify_result = hipify_python.hipify(
                                project_directory=build_directory,
                                output_directory=build_directory,
                                header_include_dirs=(extra_include_paths if extra_include_paths is not None else []),
                                extra_files=[os.path.abspath(s) for s in sources],
                                ignores=[_join_rocm_home('*'), os.path.join(_TORCH_PATH, '*')],  # no need to hipify ROCm or PyTorch headers
                                show_detailed=verbose,
                                show_progress=verbose,
                                is_pytorch_extension=True,
                                clean_ctx=clean_ctx
                            )

                            hipified_sources = set()
                            for source in sources:
                                s_abs = os.path.abspath(source)
                                hipified_sources.add(hipify_result[s_abs].hipified_path if s_abs in hipify_result else s_abs)

                            sources = list(hipified_sources)

                        _write_ninja_file_and_build_library(
                            name=name,
                            sources=sources,
                            extra_cflags=extra_cflags or [],
                            extra_cuda_cflags=extra_cuda_cflags or [],
                            extra_ldflags=extra_ldflags or [],
                            extra_include_paths=extra_include_paths or [],
                            build_directory=build_directory,
                            verbose=verbose,
                            with_cuda=with_cuda,
                            is_standalone=is_standalone)
                finally:
                    baton.release()
            else:
>               baton.wait()

/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1724: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torch.utils.file_baton.FileBaton object at 0x7f0c763a4b20>

    def wait(self):
        '''
        Periodically sleeps for a certain amount until the baton is released.

        The amount of time slept depends on the ``wait_seconds`` parameter
        passed to the constructor.
        '''
        while os.path.exists(self.lock_file_path):
>           time.sleep(self.wait_seconds)
E           KeyboardInterrupt

/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/file_baton.py:42: KeyboardInterrupt
============================================================ 40 passed, 3 warnings in 13.45s
lessw2020 commented 1 year ago

just in case - ninja version on both servers is: Requirement already satisfied: ninja in /opt/conda/envs/pytorch/lib/python3.9/site-packages (1.11.1)

rizhao-msft commented 1 year ago

The multi-GPU failure is inside pytorch when it tries to build the custom extensions. Are multiple processes trying to run the test simultaneously? It looks like multiple processes are trying to build the CUDA code at once and there's a race condition. I don't think Pytorch support multiple processes calling cpp_extensions.load() to build the CUDA code simultaneously. For thread safety in our code, we've always had it so one process executes the following to build the CUDA code first:

if torch.distributed.get_rank() == 0:
    import mx.custom_extensions

torch.distributed.barrier()

If that's the problem I don't think there's a solution to the tests hanging in a multi-GPU environment. I will document the thread safety issue above.

lessw2020 commented 1 year ago

I'm not running anything related to torch.distributed here (hence without that there is not a concept of local_rank to do the check you mentioned above) and only the 0 gpu should be active. I'm just firing the test which should only activate gpu:0 regardless of how many gpu's present. Wondering if there's a need to spec how items are moving to cuda.

rizhao-msft commented 1 year ago

Wondering if there's a need to spec how items are moving to cuda.

Do you mean how tensors are moving to the GPU?

In the multi-GPU setup, if you do: import mx.custom_extensions

Does it hang?

lessw2020 commented 1 year ago

Hi @rizhao-msft - Yes, simply importing will generate the hang:

import hang

Gave it about 5 mins to make sure... it's the exact same deadlock:

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/microxcaling/examples/../mx/custom_extensions.py", line 19, in <module>
    funcs = load(name="custom_extensions", sources=sources)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1724, in _jit_compile
    baton.wait()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/file_baton.py", line 42, in wait
    time.sleep(self.wait_seconds)
KeyboardInterrupt
rizhao-msft commented 1 year ago

Interesting, we've tested on multi-GPU machine before and never had this issue. The hang can happen when using torch.distributed to do multi-GPU training, but never just importing it in a single process.

I'll take a look to see if I can reproduce.

lessw2020 commented 1 year ago

cool - for reference, verified that the same simple import above has no issue on single gpu machine (maybe 30 second delay and then ready).

mgolub2-ms commented 1 year ago

Hi @lessw2020

A couple of questions - what pytorch version are you using? Can you try 2.1.0?

Does it still happen when you set CUDA_VISIBLE_DEVICES=0 before the command? Or if you use the PYTHONPATH=PATH_TO_MX_FOLDER?

If you paste the output of pip list format=freeze that might help us track down what is happening too.

mgolub2-ms commented 1 year ago

Wondering if there's a need to spec how items are moving to cuda.

The underlying tensors are still normal pytorch tensors, so nothing special should be needed there.

lessw2020 commented 1 year ago

Hi @mgolub2-ms: I'm running with latest pytorch nightlies (1106). Installing 2.1 now to see if that fixes.

lessw2020 commented 1 year ago

deadlock is the same with 2.1.0 (bit relieved that is the case...going back to nightlies).

>>> import mx.custom_extensions
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/microxcaling/examples/../mx/custom_extensions.py", line 19, in <module>
    funcs = load(name="custom_extensions", sources=sources)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1724, in _jit_compile
    baton.wait()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/file_baton.py", line 42, in wait
    time.sleep(self.wait_seconds)
KeyboardInterrupt
lessw2020 commented 1 year ago

Does it still happen when you set CUDA_VISIBLE_DEVICES=0 before the command?

Yes, no change, same deadlock.

lessw2020 commented 1 year ago

Here's pip freeze for reference on this server. But I can repro this on multiple servers, I don't think there's anything specific here. (A10 EC2, A100 on premise). You can spin up a $5/hr G5 A10 (4 gpu) on EC2 and repro this, might be helpful at this point as then you could investigate directly. One common aspect is I'm running CUDA 12.1 on all servers. I also have triton nightly build (open ai triton) installed on both servers...not sure if either could be an issue/delta, but if you are running only Cuda 11.8 then perhaps that might explain difference?

Package                            Version                 Editable project location
---------------------------------- ----------------------- --------------------------------
absl-py                            1.4.0
accelerate                         0.24.1
aiohttp                            3.8.4
aiosignal                          1.3.1
aniso8601                          9.0.1
ansi2html                          1.8.0
anyio                              3.6.2
appdirs                            1.4.4
argon2-cffi                        21.3.0
argon2-cffi-bindings               21.2.0
arrow                              1.2.3
asttokens                          2.2.1
async-timeout                      4.0.2
attrs                              23.1.0
awscli                             1.22.101
Babel                              2.12.1
backcall                           0.2.0
backports.functools-lru-cache      1.6.4
beautifulsoup4                     4.12.2
bitsandbytes                       0.41.0
black                              22.3.0
bleach                             6.0.0
blinker                            1.6.2
blis                               0.7.9
blobfile                           2.0.2
bokeh                              2.4.3
boto3                              1.26.142
botocore                           1.29.142
Brotli                             1.0.9
brotlipy                           0.7.0
cachetools                         5.3.1
captum                             0.5.0
catalogue                          2.0.8
cbor2                              5.4.6
certifi                            2023.5.7
cffi                               1.15.1
cfgv                               3.4.0
charset-normalizer                 3.1.0
click                              8.1.3
cloudpickle                        2.2.1
cmake                              3.25.0
colorama                           0.4.3
comm                               0.1.3
commonmark                         0.9.1
confection                         0.0.4
contextlib2                        21.6.0
coverage                           7.3.0
cryptography                       40.0.2
cycler                             0.11.0
cymem                              2.0.7
DALL-E                             0.1
dataclasses                        0.8
datasets                           2.13.1
debugpy                            1.6.7
decorator                          5.1.1
deepspeed                          0.9.5
defusedxml                         0.7.1
diffusion                          6.9.1
diffusion-core                     0.0.28
dill                               0.3.6
distlib                            0.3.7
docker-pycreds                     0.4.0
docutils                           0.15.2
dparse                             0.6.2
e                                  1.4.5
einops                             0.6.1
entrypoints                        0.4
evaluate                           0.4.0
exceptiongroup                     1.1.2
executing                          1.2.0
fairscale                          0.4.5
fastai                             2.1.10
fastcore                           1.5.29
fastjsonschema                     2.17.1
fastprogress                       1.0.3
filelock                           3.12.2
flake8                             4.0.1
flake8-bugbear                     22.4.25
flake8-polyfill                    1.0.2
Flask                              2.3.2
Flask-RESTful                      0.3.10
flit_core                          3.9.0
fonttools                          4.39.4
frozenlist                         1.3.3
fsspec                             2023.5.0
future                             0.18.3
gekko                              1.0.6
gitdb                              4.0.10
GitPython                          3.1.31
google-auth                        2.22.0
google-auth-oauthlib               1.0.0
google-pasta                       0.2.0
grpcio                             1.56.0
gym                                0.26.2
gym-notices                        0.0.8
h5py                               3.6.0
hjson                              3.1.0
horovod                            0.28.0
huggingface-hub                    0.17.3
identify                           2.5.26
idna                               3.4
imageio                            2.16.2
importlib-metadata                 4.13.0
importlib-resources                5.12.0
inflate64                          0.3.1
iniconfig                          2.0.0
install                            1.3.5
iopath                             0.1.10
ipykernel                          6.23.1
ipython                            8.13.2
ipython-genutils                   0.2.0
ipywidgets                         8.0.6
isort                              5.12.0
itsdangerous                       2.1.2
jedi                               0.18.2
Jinja2                             3.1.2
jmespath                           1.0.1
joblib                             1.2.0
json5                              0.9.5
jsonschema                         4.17.3
jupyter_client                     8.2.0
jupyter_core                       5.3.0
jupyter-server                     1.23.6
jupyterlab                         3.3.4
jupyterlab-pygments                0.2.2
jupyterlab_server                  2.22.1
jupyterlab-widgets                 3.0.7
kiwisolver                         1.4.4
langcodes                          3.3.0
libcst                             1.0.1
lit                                15.0.7
llvmlite                           0.40.0
lxml                               4.9.3
Markdown                           3.4.3
MarkupSafe                         2.1.2
matplotlib                         3.5.3
matplotlib-inline                  0.1.6
mccabe                             0.6.1
memory-efficient-attention-pytorch 0.1.6
mistune                            2.0.5
moreorless                         0.4.0
mpmath                             1.2.1
multidict                          6.0.4
multiprocess                       0.70.14
multivolumefile                    0.2.3
munkres                            1.1.4
murmurhash                         1.0.9
mypy                               1.0.1
mypy-extensions                    1.0.0
nbclassic                          0.5.6
nbclient                           0.8.0
nbconvert                          7.4.0
nbformat                           5.8.0
nest-asyncio                       1.5.6
networkx                           3.0
ninja                              1.11.1
nltk                               3.8.1
nodeenv                            1.8.0
notebook                           6.4.12
notebook_shim                      0.2.3
numba                              0.57.0
numpy                              1.24.3
nvgpu                              0.9.0
nvidia-cublas-cu11                 11.10.3.66
nvidia-cublas-cu12                 12.1.3.1
nvidia-cuda-cupti-cu11             11.7.101
nvidia-cuda-cupti-cu12             12.1.105
nvidia-cuda-nvrtc-cu11             11.7.99
nvidia-cuda-nvrtc-cu12             12.1.105
nvidia-cuda-runtime-cu11           11.7.99
nvidia-cuda-runtime-cu12           12.1.105
nvidia-cudnn-cu11                  8.5.0.96
nvidia-cudnn-cu12                  8.9.2.26
nvidia-cufft-cu11                  10.9.0.58
nvidia-cufft-cu12                  11.0.2.54
nvidia-curand-cu11                 10.2.10.91
nvidia-curand-cu12                 10.3.2.106
nvidia-cusolver-cu11               11.4.0.1
nvidia-cusolver-cu12               11.4.5.107
nvidia-cusparse-cu11               11.7.4.91
nvidia-cusparse-cu12               12.1.0.106
nvidia-nccl-cu11                   2.14.3
nvidia-nccl-cu12                   2.18.1
nvidia-nvjitlink-cu12              12.3.52
nvidia-nvtx-cu11                   11.7.91
nvidia-nvtx-cu12                   12.1.105
oauthlib                           3.2.2
opt-einsum                         3.3.0
packaging                          23.1
pandas                             1.4.4
pandocfilters                      1.5.0
parso                              0.8.3
pathos                             0.3.0
pathspec                           0.11.1
pathtools                          0.1.2
pathy                              0.10.1
patsy                              0.5.3
peft                               0.6.0
pep8-naming                        0.12.1
pexpect                            4.8.0
pickleshare                        0.7.5
Pillow                             9.5.0
pip                                22.3.1
pkgutil_resolve_name               1.3.10
platformdirs                       3.10.0
pluggy                             1.2.0
ply                                3.11
portalocker                        2.7.0
pox                                0.3.2
ppft                               1.7.6.6
pre-commit                         3.3.3
pre-commit-hooks                   4.4.0
preshed                            3.0.8
prometheus-client                  0.17.0
prompt-toolkit                     3.0.38
protobuf                           3.20.2
protobuf3-to-dict                  0.1.5
psutil                             5.9.5
ptyprocess                         0.7.0
pure-eval                          0.2.2
py-cpuinfo                         9.0.0
py7zr                              0.20.5
pyarrow                            12.0.0
pyasn1                             0.4.8
pyasn1-modules                     0.3.0
pybcj                              1.0.1
pybind11                           2.9.2
pybind11-global                    2.9.2
pycodestyle                        2.8.0
pycparser                          2.21
pycryptodomex                      3.18.0
pydantic                           1.10.8
pyflakes                           2.4.0
pyfunctional                       1.4.3
pygame                             2.1.3
Pygments                           2.15.1
pynvml                             11.5.0
pyOpenSSL                          23.1.1
pyparsing                          3.0.9
pyppmd                             1.0.0
PyQt5                              5.15.7
PyQt5-sip                          12.11.0
pyre-extensions                    0.0.29
pyrsistent                         0.19.3
PySocks                            1.7.1
pytesseract                        0.3.10
pytest                             7.4.2
pytest-cov                         4.1.0
pytest-mock                        3.8.2
python-dateutil                    2.8.2
pytorch-triton                     2.1.0+6e4932cda8
pytz                               2023.3
PyYAML                             6.0
pyzmq                              25.0.2
pyzstd                             0.15.9
regex                              2023.5.5
requests                           2.31.0
requests-oauthlib                  1.3.1
responses                          0.18.0
rich                               12.6.0
rouge                              1.0.1
rouge-score                        0.1.2
rsa                                4.7.2
ruamel.yaml                        0.17.28
ruamel.yaml.clib                   0.2.7
s3fs                               0.4.2
s3transfer                         0.6.1
sacremoses                         0.0.53
safetensors                        0.3.1
sagemaker                          2.159.0
schema                             0.7.5
scikit-learn                       1.0
scipy                              1.11.3
seaborn                            0.12.2
Send2Trash                         1.8.2
sentencepiece                      0.1.99
sentry-sdk                         1.26.0
setproctitle                       1.3.2
setuptools                         67.7.2
shap                               0.40.0
shellingham                        1.5.1
sip                                6.7.9
six                                1.16.0
slicer                             0.0.7
smart-open                         5.2.1
smclarify                          0.5
smdebug-rulesconfig                1.0.1
smmap                              5.0.0
sniffio                            1.3.0
soupsieve                          2.3.2.post1
spacy                              3.5.3
spacy-legacy                       3.0.12
spacy-loggers                      1.0.4
srsly                              2.4.6
stack-data                         0.6.2
statsmodels                        0.14.0
stdlibs                            2022.10.9
stringcase                         1.2.0
structlog                          21.5.0
sympy                              1.11.1
tabulate                           0.9.0
tblib                              1.7.0
tensorboard                        2.13.0
tensorboard-data-server            0.7.1
termcolor                          2.3.0
terminado                          0.17.1
texttable                          1.6.7
thinc                              8.1.10
threadpoolctl                      3.1.0
tiktoken                           0.4.0
timm                               0.9.2
tinycss2                           1.2.1
tokenizers                         0.14.1
toml                               0.10.2
tomli                              2.0.1
tomlkit                            0.12.1
torch                              2.1.0
torch-model-archiver               0.5.3b20220226
torch-workflow-archiver            0.2.8b20230512
torchaudio                         2.1.0
torchmultimodal                    0.1.0b0                 /home/ubuntu/PyTorch_MultiModal
torchserve                         0.6.0b20220513
torchtext                          0.14.1
torchvision                        0.16.0
tornado                            6.3.2
tqdm                               4.63.2
trailrunner                        1.4.0
traitlets                          5.9.0
transformer_nuggets                0.0.1                   /home/ubuntu/transformer_nuggets
transformers                       4.35.0
triton                             2.1.0
triton-nightly                     2.1.0.dev20231012235740
typer                              0.7.0
typing_extensions                  4.6.2
typing-inspect                     0.9.0
ufmt                               1.3.0
unicodedata2                       15.0.0
urllib3                            1.26.15
usort                              1.0.2
virtualenv                         20.24.3
vit-pytorch                        1.2.2
wandb                              0.15.4
wasabi                             1.1.1
wcwidth                            0.2.6
webencodings                       0.5.1
websocket-client                   1.5.2
Werkzeug                           2.3.4
wget                               3.2
wheel                              0.40.0
widgetsnbextension                 4.0.7
xformers                           0.0.20
xxhash                             3.2.0
yarl                               1.9.2
zipp                               3.15.0
mgolub2-ms commented 1 year ago

Here's pip freeze for reference on this server. But I can repro this on multiple servers, I don't think there's anything specific here. (A10 EC2, A100 on premise). You can spin up a $5/hr G5 A10 (4 gpu) on EC2 and repro this, might be helpful at this point as then you could investigate directly. One common aspect is I'm running CUDA 12.1 on all servers. I also have triton nightly build (open ai triton) installed on both servers...not sure if either could be an issue/delta, but if you are running only Cuda 11.8 then perhaps that might explain difference?

@rizhao-msft Spun up a multi GPU (2) azure VM and was unable to reproduce the hang - we are continuing to investigate

mgolub2-ms commented 1 year ago

Hmm, this is odd - I've been unable to reproduce the issue either, using cuda 12.1, torch 2.1.0, on a 4 GPU A100 azure VM.

I tried configuring accelerate and install triton/pytorch triton, no issues running the tests, or running the examples. I've also tried setting the path using sys.path.append and PYTHONPATH, to test if there was an issue there, both work for me.

One thing I noticed that is a bit odd - you have some cuda 11 packages and some cuda 12 packages:

nvidia-cublas-cu11                 11.10.3.66
nvidia-cublas-cu12                 12.1.3.1
nvidia-cuda-cupti-cu11             11.7.101
nvidia-cuda-cupti-cu12             12.1.105
nvidia-cuda-nvrtc-cu11             11.7.99
nvidia-cuda-nvrtc-cu12             12.1.105
nvidia-cuda-runtime-cu11           11.7.99
nvidia-cuda-runtime-cu12           12.1.105
nvidia-cudnn-cu11                  8.5.0.96
nvidia-cudnn-cu12                  8.9.2.26
nvidia-cufft-cu11                  10.9.0.58
nvidia-cufft-cu12                  11.0.2.54
nvidia-curand-cu11                 10.2.10.91
nvidia-curand-cu12                 10.3.2.106
nvidia-cusolver-cu11               11.4.0.1
nvidia-cusolver-cu12               11.4.5.107
nvidia-cusparse-cu11               11.7.4.91
nvidia-cusparse-cu12               12.1.0.106
nvidia-nccl-cu11                   2.14.3
nvidia-nccl-cu12                   2.18.1
nvidia-nvjitlink-cu12              12.3.52
nvidia-nvtx-cu11                   11.7.91
nvidia-nvtx-cu12                   12.1.105

For comparison, my environment only has cuda 12 packages:

nvidia-cublas-cu12                      12.1.3.1
nvidia-cuda-cupti-cu12                  12.1.105
nvidia-cuda-nvrtc-cu12                  12.1.105
nvidia-cuda-runtime-cu12                12.1.105
nvidia-cudnn-cu12                       8.9.2.26
nvidia-cufft-cu12                       11.0.2.54
nvidia-curand-cu12                      10.3.2.106
nvidia-cusolver-cu12                    11.4.5.107
nvidia-cusparse-cu12                    12.1.0.106
nvidia-ml-py                            11.525.112
nvidia-ml-py3                           7.352.0
nvidia-nccl-cu12                        2.18.1
nvidia-nvjitlink-cu12                   12.3.52
nvidia-nvtx-cu12                        12.1.105

Is it possible to try an environment with a single set of cuda packages? Or even a brand new virtualenv and then install the requirements using pip install -r requirements.txt ?

Not sure why multi-gpu would be affected by that vs. a single gpu. It's really interesting that CUDA_VISIBLE_DEVICES does not help - to my knowledge, that basically emulates a single GPU system for any command.

lessw2020 commented 1 year ago

Thanks for the updates @rizhao-msft and @mgolub2-ms. I'll pull a brand new EC2 instance this afternoon to help isolate this.
I believe the multi-cuda (11 and 12) is from the starting AWS AMI (i.e. they have a bundle in there), not something I overtly installed. But, at least starting with a brand new image will help further reduce the search space.

lessw2020 commented 1 year ago

HI @rizhao-msft and @mgolub2-ms
Some good news here. Clean new machine works great. (with Pytorch 2.1). I'll upgrade to nightlies next but at least for now this hang issue is localized to something in my configs.

Here's the pip freeze from new AWS instance + pip install -r requirements.txt so there's a snapshot to diff against:

import sys sys.path.append('..') import mx.custom_extensions exit() (pytorch) ubuntu@ip-172-31-27-89:~/microxcaling/mx$ pip list format=freeze Package Version


aniso8601 9.0.1 annotated-types 0.6.0 ansi2html 1.8.0 anyio 4.0.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 async-lru 2.0.4 attrs 23.1.0 awscli 1.29.76 Babel 2.13.1 backports.functools-lru-cache 1.6.5 beautifulsoup4 4.12.2 bleach 6.1.0 blinker 1.7.0 bokeh 3.3.0 boto3 1.28.76 botocore 1.31.76 Brotli 1.1.0 cached-property 1.5.2 captum 0.6.0 certifi 2023.7.22 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 2.2.1 colorama 0.4.4 comm 0.1.4 contextlib2 21.6.0 contourpy 1.1.1 cryptography 41.0.5 cycler 0.12.1 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.7 docutils 0.15.2 dparse 0.6.3 entrypoints 0.4 exceptiongroup 1.1.3 executing 2.0.1 fastjsonschema 2.18.1 filelock 3.13.1 Flask 3.0.0 Flask-RESTful 0.3.10 fonttools 4.43.1 fqdn 1.5.1 fsspec 2023.10.0 gmpy2 2.1.2 google-pasta 0.2.0 gym 0.26.2 gym-notices 0.0.8 idna 3.4 imageio 2.31.5 importlib-metadata 6.8.0 importlib-resources 6.1.0 iniconfig 2.0.0 ipykernel 6.26.0 ipython 8.17.2 ipython-genutils 0.2.0 ipywidgets 8.1.1 isoduration 20.11.0 itsdangerous 2.1.2 jedi 0.19.1 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 json5 0.9.14 jsonpointer 2.4 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter_client 8.5.0 jupyter_core 5.5.0 jupyter-events 0.8.0 jupyter-lsp 2.2.0 jupyter_server 2.9.1 jupyter_server_terminals 0.4.4 jupyterlab 4.0.8 jupyterlab-pygments 0.2.2 jupyterlab_server 2.25.0 jupyterlab-widgets 3.0.9 kiwisolver 1.4.5 llvmlite 0.41.1 MarkupSafe 2.1.3 matplotlib 3.8.1 matplotlib-inline 0.1.6 mistune 3.0.2 mpmath 1.3.0 multiprocess 0.70.15 munkres 1.1.4 nbclassic 1.0.0 nbclient 0.8.0 nbconvert 7.10.0 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.2.1 ninja 1.11.1.1 notebook 6.5.4 notebook_shim 0.2.3 numba 0.58.1 numpy 1.26.0 nvgpu 0.10.0 overrides 7.4.0 packaging 21.3 pandas 2.1.2 pandocfilters 1.5.0 parso 0.8.3 pathos 0.3.1 patsy 0.5.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.1.0 pip 23.3.1 pkgutil_resolve_name 1.3.10 platformdirs 3.11.0 pluggy 1.3.0 pox 0.3.3 ppft 1.7.6.7 prometheus-client 0.18.0 prompt-toolkit 3.0.39 protobuf 4.25.0 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 14.0.0 pyasn1 0.5.0 pybind11 2.11.1 pybind11-global 2.11.1 pycparser 2.21 pydantic 2.4.2 pydantic_core 2.10.1 pyfunctional 1.4.3 pygame 2.5.2 Pygments 2.16.1 pynvml 11.5.0 pyparsing 3.1.1 PySocks 1.7.1 pytest 7.4.3 python-dateutil 2.8.2 python-json-logger 2.0.7 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.1 referencing 0.30.2 requests 2.31.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rpds-py 0.10.6 rsa 4.7.2 ruamel.yaml 0.18.4 ruamel.yaml.clib 0.2.8 s3fs 0.4.2 s3transfer 0.7.0 sagemaker 2.196.0 schema 0.7.5 scikit-learn 1.3.2 scipy 1.11.3 seaborn 0.13.0 Send2Trash 1.8.2 setuptools 68.2.2 shap 0.43.0 six 1.16.0 slicer 0.0.7 smclarify 0.5 smdebug-rulesconfig 1.0.1 sniffio 1.3.0 soupsieve 2.5 stack-data 0.6.2 statsmodels 0.14.0 sympy 1.12 tabulate 0.9.0 tblib 1.7.0 termcolor 2.3.0 terminado 0.17.1 threadpoolctl 3.2.0 tinycss2 1.2.1 tomli 2.0.1 torch 2.1.0 torch-model-archiver 0.7.1b20230208 torch-workflow-archiver 0.2.11b20231012 torchaudio 2.1.0 torchdata 0.7.0 torchserve 0.9.0b20231012 torchtext 0.16.0 torchvision 0.16.0 tornado 6.3.3 tqdm 4.66.1 traitlets 5.9.0 triton 2.1.0 types-python-dateutil 2.8.19.14 typing_extensions 4.8.0 typing-utils 0.1.0 tzdata 2023.3 unicodedata2 15.1.0 uri-template 1.3.0 urllib3 1.26.18 wcwidth 0.2.9 webcolors 1.13 webencodings 0.5.1 websocket-client 1.6.4 Werkzeug 3.0.1 wheel 0.41.3 widgetsnbextension 4.0.9 xyzservices 2023.10.1 zipp 3.17.0

lessw2020 commented 1 year ago

ok working great with latest PT nightly (1108) as well. I also don't find any of the cuda libs running around on this new machine, so it seems that having those may have been the issue? (may have come from building PyTorch from source). Anyway, at least for EC2, this issue is resolved. I will setup a new env for on prem and see if that also resolves.

rizhao-msft commented 1 year ago

Good! We also have an alternative way to build the CUDA extensions, on the branch dev/rizhao/prebuilt_extensions. You can switch to that branch, then go to mx/cpp, and run python setup.py install. This will install the CUDA extension as a Python package, and it will avoid running the load() function that was hanging.

This could help if you run into the problem again.

lessw2020 commented 1 year ago

That sounds great re: prebuilt - I'll switch to that branch. I'll start converting a toy model tomorrow to start actively running with MX FP6 now that I'm past the hang. I'll go ahead and close this one since it's clear it's not a generic issue like I initially thought based on it showing on 2 diff servers. Thanks for the help on getting past the hanging issue!