Closed lessw2020 closed 1 year ago
It looks like the first GPU unit test that uses the custom CUDA code is hanging - usually the first such test takes a minute because it builds the CUDA code as a Pytorch extension. If it hangs for more than a minute this could be an error building that code.
Try doing Ctrl+C when it hangs, then scroll all the way up (there's going to be a bunch of pytest messages, but the first message will have the CUDA build error if it is a build issue). There's likely a bunch of CUDA warnings too.
Can you make sure you have the ninja package installed?
Hi @rizhao-msft - thanks for the info above.
I found that running with a single gpu (A100) then the tests all pass with just some warnings about pkg namespace:
test_activations.py::test_gelu[cuda-True-False-10-False]
/data/home/less/miniconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py:28: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import packaging # type: ignore[attr-defined]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
/data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
/data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.cloud')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
/data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2350: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(parent)
test_activations.py::test_gelu[cuda-True-False-10-False]
/data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.logging')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
test_activations.py::test_gelu[cuda-True-False-10-False]
/data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
test_activations.py::test_gelu[cuda-True-False-10-False]
/data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 186 passed, 16 warnings in 33.01s
I do have ninja installed on both servers. Running with 4 or 8 gpus, I get the hang. using the ctrl+C break you mentioned results in this trace:
(pytorch) ubuntu@ip-172-31-66-198:~/microxcaling/mx/tests$ pytest test_activations.py -vs --full-trace
================================================================== test session starts ==================================================================
platform linux -- Python 3.9.16, pytest-7.4.2, pluggy-1.2.0 -- /opt/conda/envs/pytorch/bin/python3.9
cachedir: .pytest_cache
rootdir: /home/ubuntu/microxcaling/mx/tests
plugins: anyio-3.6.2, mock-3.8.2, cov-4.1.0
collected 144 items
test_activations.py::test_activation[cpu-False-True-10-tanh-tanh] PASSED
test_activations.py::test_activation[cpu-False-True-10-relu-relu] PASSED
test_activations.py::test_activation[cpu-False-True-10-relu6-relu6] PASSED
test_activations.py::test_activation[cpu-False-True-10-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cpu-False-True-10-silu-silu] PASSED
test_activations.py::test_activation[cpu-False-True-10000-tanh-tanh] PASSED
test_activations.py::test_activation[cpu-False-True-10000-relu-relu] PASSED
test_activations.py::test_activation[cpu-False-True-10000-relu6-relu6] PASSED
test_activations.py::test_activation[cpu-False-True-10000-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cpu-False-True-10000-silu-silu] PASSED
test_activations.py::test_activation[cpu-False-False-10-tanh-tanh] PASSED
test_activations.py::test_activation[cpu-False-False-10-relu-relu] PASSED
test_activations.py::test_activation[cpu-False-False-10-relu6-relu6] PASSED
test_activations.py::test_activation[cpu-False-False-10-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cpu-False-False-10-silu-silu] PASSED
test_activations.py::test_activation[cpu-False-False-10000-tanh-tanh] PASSED
test_activations.py::test_activation[cpu-False-False-10000-relu-relu] PASSED
test_activations.py::test_activation[cpu-False-False-10000-relu6-relu6] PASSED
test_activations.py::test_activation[cpu-False-False-10000-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cpu-False-False-10000-silu-silu] PASSED
test_activations.py::test_activation[cuda-False-True-10-tanh-tanh] PASSED
test_activations.py::test_activation[cuda-False-True-10-relu-relu] PASSED
test_activations.py::test_activation[cuda-False-True-10-relu6-relu6] PASSED
test_activations.py::test_activation[cuda-False-True-10-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cuda-False-True-10-silu-silu] PASSED
test_activations.py::test_activation[cuda-False-True-10000-tanh-tanh] PASSED
test_activations.py::test_activation[cuda-False-True-10000-relu-relu] PASSED
test_activations.py::test_activation[cuda-False-True-10000-relu6-relu6] PASSED
test_activations.py::test_activation[cuda-False-True-10000-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cuda-False-True-10000-silu-silu] PASSED
test_activations.py::test_activation[cuda-False-False-10-tanh-tanh] PASSED
test_activations.py::test_activation[cuda-False-False-10-relu-relu] PASSED
test_activations.py::test_activation[cuda-False-False-10-relu6-relu6] PASSED
test_activations.py::test_activation[cuda-False-False-10-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cuda-False-False-10-silu-silu] PASSED
test_activations.py::test_activation[cuda-False-False-10000-tanh-tanh] PASSED
test_activations.py::test_activation[cuda-False-False-10000-relu-relu] PASSED
test_activations.py::test_activation[cuda-False-False-10000-relu6-relu6] PASSED
test_activations.py::test_activation[cuda-False-False-10000-leaky_relu-leaky_relu] PASSED
test_activations.py::test_activation[cuda-False-False-10000-silu-silu] PASSED
test_activations.py::test_activation[cuda-True-True-10-tanh-tanh] ^C
=================================================================== warnings summary ====================================================================
test_activations.py::test_activation[cuda-True-True-10-tanh-tanh]
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
test_activations.py::test_activation[cuda-True-True-10-tanh-tanh]
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
test_activations.py::test_activation[cuda-True-True-10-tanh-tanh]
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
config = <_pytest.config.Config object at 0x7f0dd3a880d0>, doit = <function _main at 0x7f0dd340bca0>
def wrap_session(
config: Config, doit: Callable[[Config, "Session"], Optional[Union[int, ExitCode]]]
) -> Union[int, ExitCode]:
"""Skeleton command line program."""
session = Session.from_config(config)
session.exitstatus = ExitCode.OK
initstate = 0
try:
try:
config._do_configure()
initstate = 1
config.hook.pytest_sessionstart(session=session)
initstate = 2
> session.exitstatus = doit(config, session) or 0
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/main.py:271:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
config = <_pytest.config.Config object at 0x7f0dd3a880d0>, session = <Session tests exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=144>
def _main(config: Config, session: "Session") -> Optional[Union[int, ExitCode]]:
"""Default command line protocol for initialization, session,
running tests and reporting."""
config.hook.pytest_collection(session=session)
> config.hook.pytest_runtestloop(session=session)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/main.py:325:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_HookCaller 'pytest_runtestloop'>, kwargs = {'session': <Session tests exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=144>}
firstresult = True
def __call__(self, **kwargs: object) -> Any:
assert (
not self.is_historic()
), "Cannot directly call a historic hook - use call_historic instead."
self._verify_all_args_are_provided(kwargs)
firstresult = self.spec.opts.get("firstresult", False) if self.spec else False
> return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_hooks.py:433:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_pytest.config.PytestPluginManager object at 0x7f0dd31240a0>, hook_name = 'pytest_runtestloop'
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/opt/conda/envs/pytorch/lib/python3.9/site-packages...t/main.py'>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f0dd2fd4070>>]
kwargs = {'session': <Session tests exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=144>}, firstresult = True
def _hookexec(
self,
hook_name: str,
methods: Sequence[HookImpl],
kwargs: Mapping[str, object],
firstresult: bool,
) -> object | list[object]:
# called from all hookcaller instances.
# enable_tracing will set its own wrapping function at self._inner_hookexec
> return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_manager.py:112:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
session = <Session tests exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=144>
def pytest_runtestloop(session: "Session") -> bool:
if session.testsfailed and not session.config.option.continue_on_collection_errors:
raise session.Interrupted(
"%d error%s during collection"
% (session.testsfailed, "s" if session.testsfailed != 1 else "")
)
if session.config.option.collectonly:
return True
for i, item in enumerate(session.items):
nextitem = session.items[i + 1] if i + 1 < len(session.items) else None
> item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/main.py:350:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_HookCaller 'pytest_runtest_protocol'>
kwargs = {'item': <Function test_activation[cuda-True-True-10-tanh-tanh]>, 'nextitem': <Function test_activation[cuda-True-True-10-relu-relu]>}
firstresult = True
def __call__(self, **kwargs: object) -> Any:
assert (
not self.is_historic()
), "Cannot directly call a historic hook - use call_historic instead."
self._verify_all_args_are_provided(kwargs)
firstresult = self.spec.opts.get("firstresult", False) if self.spec else False
> return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_hooks.py:433:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_pytest.config.PytestPluginManager object at 0x7f0dd31240a0>, hook_name = 'pytest_runtest_protocol'
methods = [<HookImpl plugin_name='runner', plugin=<module '_pytest.runner' from '/opt/conda/envs/pytorch/lib/python3.9/site-pack...s', plugin=<module '_pytest.warnings' from '/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/warnings.py'>>]
kwargs = {'item': <Function test_activation[cuda-True-True-10-tanh-tanh]>, 'nextitem': <Function test_activation[cuda-True-True-10-relu-relu]>}
firstresult = True
def _hookexec(
self,
hook_name: str,
methods: Sequence[HookImpl],
kwargs: Mapping[str, object],
firstresult: bool,
) -> object | list[object]:
# called from all hookcaller instances.
# enable_tracing will set its own wrapping function at self._inner_hookexec
> return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_manager.py:112:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
item = <Function test_activation[cuda-True-True-10-tanh-tanh]>, nextitem = <Function test_activation[cuda-True-True-10-relu-relu]>
def pytest_runtest_protocol(item: Item, nextitem: Optional[Item]) -> bool:
ihook = item.ihook
ihook.pytest_runtest_logstart(nodeid=item.nodeid, location=item.location)
> runtestprotocol(item, nextitem=nextitem)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
item = <Function test_activation[cuda-True-True-10-tanh-tanh]>, log = True, nextitem = <Function test_activation[cuda-True-True-10-relu-relu]>
def runtestprotocol(
item: Item, log: bool = True, nextitem: Optional[Item] = None
) -> List[TestReport]:
hasrequest = hasattr(item, "_request")
if hasrequest and not item._request: # type: ignore[attr-defined]
# This only happens if the item is re-run, as is done by
# pytest-rerunfailures.
item._initrequest() # type: ignore[attr-defined]
rep = call_and_report(item, "setup", log)
reports = [rep]
if rep.passed:
if item.config.getoption("setupshow", False):
show_test_item(item)
if not item.config.getoption("setuponly", False):
> reports.append(call_and_report(item, "call", log))
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:133:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
item = <Function test_activation[cuda-True-True-10-tanh-tanh]>, when = 'call', log = True, kwds = {}
def call_and_report(
item: Item, when: "Literal['setup', 'call', 'teardown']", log: bool = True, **kwds
) -> TestReport:
> call = call_runtest_hook(item, when, **kwds)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:222:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
item = <Function test_activation[cuda-True-True-10-tanh-tanh]>, when = 'call', kwds = {}
reraise = (<class '_pytest.outcomes.Exit'>, <class 'KeyboardInterrupt'>)
def call_runtest_hook(
item: Item, when: "Literal['setup', 'call', 'teardown']", **kwds
) -> "CallInfo[None]":
if when == "setup":
ihook: Callable[..., None] = item.ihook.pytest_runtest_setup
elif when == "call":
ihook = item.ihook.pytest_runtest_call
elif when == "teardown":
ihook = item.ihook.pytest_runtest_teardown
else:
assert False, f"Unhandled runtest hook case: {when}"
reraise: Tuple[Type[BaseException], ...] = (Exit,)
if not item.config.getoption("usepdb", False):
reraise += (KeyboardInterrupt,)
> return CallInfo.from_call(
lambda: ihook(item=item, **kwds), when=when, reraise=reraise
)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:261:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = <class '_pytest.runner.CallInfo'>, func = <function call_runtest_hook.<locals>.<lambda> at 0x7f0d312a2940>, when = 'call'
reraise = (<class '_pytest.outcomes.Exit'>, <class 'KeyboardInterrupt'>)
@classmethod
def from_call(
cls,
func: "Callable[[], TResult]",
when: "Literal['collect', 'setup', 'call', 'teardown']",
reraise: Optional[
Union[Type[BaseException], Tuple[Type[BaseException], ...]]
] = None,
) -> "CallInfo[TResult]":
"""Call func, wrapping the result in a CallInfo.
:param func:
The function to call. Called without arguments.
:param when:
The phase in which the function is called.
:param reraise:
Exception or exceptions that shall propagate if raised by the
function, instead of being wrapped in the CallInfo.
"""
excinfo = None
start = timing.time()
precise_start = timing.perf_counter()
try:
> result: Optional[TResult] = func()
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:341:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> lambda: ihook(item=item, **kwds), when=when, reraise=reraise
)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:262:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_HookCaller 'pytest_runtest_call'>, kwargs = {'item': <Function test_activation[cuda-True-True-10-tanh-tanh]>}, firstresult = False
def __call__(self, **kwargs: object) -> Any:
assert (
not self.is_historic()
), "Cannot directly call a historic hook - use call_historic instead."
self._verify_all_args_are_provided(kwargs)
firstresult = self.spec.opts.get("firstresult", False) if self.spec else False
> return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_hooks.py:433:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_pytest.config.PytestPluginManager object at 0x7f0dd31240a0>, hook_name = 'pytest_runtest_call'
methods = [<HookImpl plugin_name='runner', plugin=<module '_pytest.runner' from '/opt/conda/envs/pytorch/lib/python3.9/site-pack...dule '_pytest.threadexception' from '/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/threadexception.py'>>]
kwargs = {'item': <Function test_activation[cuda-True-True-10-tanh-tanh]>}, firstresult = False
def _hookexec(
self,
hook_name: str,
methods: Sequence[HookImpl],
kwargs: Mapping[str, object],
firstresult: bool,
) -> object | list[object]:
# called from all hookcaller instances.
# enable_tracing will set its own wrapping function at self._inner_hookexec
> return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_manager.py:112:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
item = <Function test_activation[cuda-True-True-10-tanh-tanh]>
def pytest_runtest_call(item: Item) -> None:
_update_current_test_var(item, "call")
try:
del sys.last_type
del sys.last_value
del sys.last_traceback
except AttributeError:
pass
try:
> item.runtest()
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/runner.py:169:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Function test_activation[cuda-True-True-10-tanh-tanh]>
def runtest(self) -> None:
"""Execute the underlying test function."""
> self.ihook.pytest_pyfunc_call(pyfuncitem=self)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/python.py:1792:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_HookCaller 'pytest_pyfunc_call'>, kwargs = {'pyfuncitem': <Function test_activation[cuda-True-True-10-tanh-tanh]>}, firstresult = True
def __call__(self, **kwargs: object) -> Any:
assert (
not self.is_historic()
), "Cannot directly call a historic hook - use call_historic instead."
self._verify_all_args_are_provided(kwargs)
firstresult = self.spec.opts.get("firstresult", False) if self.spec else False
> return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_hooks.py:433:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_pytest.config.PytestPluginManager object at 0x7f0dd31240a0>, hook_name = 'pytest_pyfunc_call'
methods = [<HookImpl plugin_name='python', plugin=<module '_pytest.python' from '/opt/conda/envs/pytorch/lib/python3.9/site-pack...ugin=<module 'anyio.pytest_plugin' from '/opt/conda/envs/pytorch/lib/python3.9/site-packages/anyio/pytest_plugin.py'>>]
kwargs = {'pyfuncitem': <Function test_activation[cuda-True-True-10-tanh-tanh]>}, firstresult = True
def _hookexec(
self,
hook_name: str,
methods: Sequence[HookImpl],
kwargs: Mapping[str, object],
firstresult: bool,
) -> object | list[object]:
# called from all hookcaller instances.
# enable_tracing will set its own wrapping function at self._inner_hookexec
> return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/pluggy/_manager.py:112:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyfuncitem = <Function test_activation[cuda-True-True-10-tanh-tanh]>
@hookimpl(trylast=True)
def pytest_pyfunc_call(pyfuncitem: "Function") -> Optional[object]:
testfunction = pyfuncitem.obj
if is_async_function(testfunction):
async_warn_and_skip(pyfuncitem.nodeid)
funcargs = pyfuncitem.funcargs
testargs = {arg: funcargs[arg] for arg in pyfuncitem._fixtureinfo.argnames}
> result = testfunction(**testargs)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/_pytest/python.py:194:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
f1 = <built-in method tanh of type object at 0x7f0dd1982040>, f2 = <function tanh at 0x7f0d312cbee0>, size = 10, quantize_backprop = True
device = 'cuda', custom_cuda = True
@pytest.mark.parametrize("f1, f2", [
# (torch.sigmoid, sigmoid),
(torch.tanh, tanh),
(F.relu, relu),
(F.relu6, relu6),
(F.leaky_relu, leaky_relu),
(F.silu, silu),
])
@pytest.mark.parametrize("size", SIZE)
@pytest.mark.parametrize("quantize_backprop", [True, False])
@pytest.mark.parametrize("device, custom_cuda", DEVICE__CUSTOM_CUDA)
def test_activation(f1, f2, size, quantize_backprop, device, custom_cuda):
# mx specs. Use a large bitwidth since we're testing
# algorithmic correctness, not precision
mx_specs = apply_mx_specs(None)
mx_specs['bfloat'] = 30
mx_specs['quantize_backprop'] = quantize_backprop
mx_specs['custom_cuda'] = custom_cuda
kwargs = {'negative_slope': 0.4} if f2 is leaky_relu else {}
# Create shared input for two networks
m_ = np.random.randn(size)
m1 = torch.tensor(m_, dtype=torch.float32, device=device, requires_grad=True)
m2 = torch.tensor(m_, dtype=torch.float32, device=device, requires_grad=True)
q1 = f1(m1, **kwargs)
loss1 = (q1**2).sum()
loss1.backward()
torch.cuda.synchronize()
> q2 = f2(m2, mx_specs=mx_specs, **kwargs)
test_activations.py:106:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input = tensor([-0.0311, -0.0383, 1.2802, 1.1709, 0.1558, 0.3154, -0.3951, 0.4267,
-0.8888, 0.4220], device='cuda:0', requires_grad=True)
mx_specs = {'scale_bits': 0, 'w_elem_format': None, 'a_elem_format': None, 'w_elem_format_bp': None, 'a_elem_format_bp_ex': None,...put_grad_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}
name = None
def tanh(input, mx_specs=None, name=None):
mx_assert_test(mx_specs)
if mx_specs is None:
return torch.tanh(input)
mx_specs = apply_mx_specs(mx_specs)
> return TanhFunction.apply(input, mx_specs, name)
../activations.py:32:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = <class 'mx.activations.TanhFunction'>
args = (tensor([-0.0311, -0.0383, 1.2802, 1.1709, 0.1558, 0.3154, -0.3951, 0.4267,
-0.8888, 0.4220], device='cu...d_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}, None)
kwargs = {}, bind_default_args = <function Function.apply.<locals>.bind_default_args at 0x7f0d312a2dc0>, is_setup_ctx_defined = False
@classmethod
def apply(cls, *args, **kwargs):
def bind_default_args(func, *args, **kwargs):
signature = inspect.signature(func)
bound_args = signature.bind(*args, **kwargs)
bound_args.apply_defaults()
return bound_args.args
is_setup_ctx_defined = cls.setup_context != _SingleLevelFunction.setup_context
if is_setup_ctx_defined:
args = bind_default_args(cls.forward, *args, **kwargs)
if not torch._C._are_functorch_transforms_active():
# See NOTE: [functorch vjp and autograd interaction]
args = _functorch.utils.unwrap_dead_wrappers(args)
> return super().apply(*args, **kwargs) # type: ignore[misc]
/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:551:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ctx = <torch.autograd.function.TanhFunctionBackward object at 0x7f0d312a4740>
input = tensor([-0.0311, -0.0383, 1.2802, 1.1709, 0.1558, 0.3154, -0.3951, 0.4267,
-0.8888, 0.4220], device='cuda:0', requires_grad=True)
mx_specs = {'scale_bits': 0, 'w_elem_format': None, 'a_elem_format': None, 'w_elem_format_bp': None, 'a_elem_format_bp_ex': None,...put_grad_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}
name = None
@staticmethod
def forward(ctx, input, mx_specs=None, name=None):
ctx.name = name
> input = vec_quantize(input, mx_specs=mx_specs)
../activations.py:257:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input = tensor([-0.0311, -0.0383, 1.2802, 1.1709, 0.1558, 0.3154, -0.3951, 0.4267,
-0.8888, 0.4220], device='cuda:0', requires_grad=True)
mx_specs = {'scale_bits': 0, 'w_elem_format': None, 'a_elem_format': None, 'w_elem_format_bp': None, 'a_elem_format_bp_ex': None,...put_grad_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}
round = None
def vec_quantize(input, mx_specs=None, round=None):
> return quantize_elemwise_op(input, mx_specs=mx_specs,
round=round)
../vector_ops.py:35:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([-0.0311, -0.0383, 1.2802, 1.1709, 0.1558, 0.3154, -0.3951, 0.4267,
-0.8888, 0.4220], device='cuda:0', requires_grad=True)
mx_specs = {'scale_bits': 0, 'w_elem_format': None, 'a_elem_format': None, 'w_elem_format_bp': None, 'a_elem_format_bp_ex': None,...put_grad_weight': 'nearest', 'softmax_exp2': False, 'vec_use_exp2': False, 'vec_use_recip': False, 'custom_cuda': True}
round = 'nearest'
def quantize_elemwise_op(A, mx_specs, round=None):
"""A function used for element-wise quantization with mx_specs
Arguments:
A {PyTorch tensor} -- a tensor that needs to be quantized
mx_specs {dictionary} -- dictionary to specify mx_specs
round {str} -- Rounding mode, choose from (floor, nearest, even)
(default: "nearest")
Returns:
quantized value {PyTorch tensor} -- a tensor that has been quantized
"""
if mx_specs is None:
return A
elif round is None:
round = mx_specs['round']
if mx_specs['bfloat'] > 0 and mx_specs['fp'] > 0:
raise ValueError("Cannot set both [bfloat] and [fp] in mx_specs.")
elif mx_specs['bfloat'] > 9:
> A = _quantize_bfloat(A, bfloat=mx_specs['bfloat'], round=round,
custom_cuda=mx_specs['custom_cuda'],
../elemwise_ops.py:253:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([-0.0311, -0.0383, 1.2802, 1.1709, 0.1558, 0.3154, -0.3951, 0.4267,
-0.8888, 0.4220], device='cuda:0', requires_grad=True)
bfloat = 30, round = 'nearest', custom_cuda = True, allow_denorm = True
def _quantize_bfloat(A, bfloat, round='nearest', custom_cuda=False, allow_denorm=True):
""" Quantize values to bfloatX format
Arguments:
bfloat {int} -- Total number of bits for bfloatX format,
Includes 1 sign, 8 exp bits, and variable
mantissa bits. Must be >= 9.
"""
# Shortcut for no quantization
if bfloat == 0 or bfloat == 32:
return A
max_norm = _get_max_norm(8, bfloat-7)
> return _quantize_elemwise_core(
A, bits=bfloat-7, exp_bits=8, max_norm=max_norm, round=round,
allow_denorm=allow_denorm, custom_cuda=custom_cuda)
../elemwise_ops.py:206:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
A = tensor([-0.0311, -0.0383, 1.2802, 1.1709, 0.1558, 0.3154, -0.3951, 0.4267,
-0.8888, 0.4220], device='cuda:0', requires_grad=True)
bits = 23, exp_bits = 8, max_norm = 3.4028228579130005e+38, round = 'nearest', saturate_normals = False, allow_denorm = True, custom_cuda = True
def _quantize_elemwise_core(A, bits, exp_bits, max_norm, round='nearest',
saturate_normals=False, allow_denorm=True,
custom_cuda=False):
""" Core function used for element-wise quantization
Arguments:
A {PyTorch tensor} -- A tensor to be quantized
bits {int} -- Number of mantissa bits. Includes
sign bit and implicit one for floats
exp_bits {int} -- Number of exponent bits, 0 for ints
max_norm {float} -- Largest representable normal number
round {str} -- Rounding mode: (floor, nearest, even)
saturate_normals {bool} -- If True, normal numbers (i.e., not NaN/Inf)
that exceed max norm are clamped.
Must be True for correct MX conversion.
allow_denorm {bool} -- If False, flush denorm numbers in the
elem_format to zero.
custom_cuda {str} -- If True, use custom CUDA kernels
Returns:
quantized tensor {PyTorch tensor} -- A tensor that has been quantized
"""
A_is_sparse = A.is_sparse
if A_is_sparse:
if A.layout != torch.sparse_coo:
raise NotImplementedError("Only COO layout sparse tensors are currently supported.")
sparse_A = A.coalesce()
A = sparse_A.values().clone()
# custom cuda only support floor and nearest rounding modes
custom_cuda = custom_cuda and round in RoundingMode.string_enums()
if custom_cuda:
A = A.contiguous()
> from . import custom_extensions
../elemwise_ops.py:118:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
"""
Copyright (c) Microsoft Corporation.
Licensed under the MIT License.
Python interface for custom CUDA implementations of functions.
"""
import os
from torch.utils.cpp_extension import load
sources = [
"funcs.cpp",
"mx.cu",
"elemwise.cu",
"reduce.cu",
]
file_dir = os.path.dirname(__file__)
sources = [os.path.join(file_dir, "cpp", x) for x in sources]
> funcs = load(name="custom_extensions", sources=sources)
../custom_extensions.py:19:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
name = 'custom_extensions'
sources = ['/home/ubuntu/microxcaling/mx/cpp/funcs.cpp', '/home/ubuntu/microxcaling/mx/cpp/mx.cu', '/home/ubuntu/microxcaling/mx/cpp/elemwise.cu', '/home/ubuntu/microxcaling/mx/cpp/reduce.cu']
extra_cflags = None, extra_cuda_cflags = None, extra_ldflags = None, extra_include_paths = None, build_directory = None, verbose = False
with_cuda = None, is_python_module = True, is_standalone = False, keep_intermediates = True
def load(name,
sources: Union[str, List[str]],
extra_cflags=None,
extra_cuda_cflags=None,
extra_ldflags=None,
extra_include_paths=None,
build_directory=None,
verbose=False,
with_cuda: Optional[bool] = None,
is_python_module=True,
is_standalone=False,
keep_intermediates=True):
r'''
Loads a PyTorch C++ extension just-in-time (JIT).
To load an extension, a Ninja build file is emitted, which is used to
compile the given sources into a dynamic library. This library is
subsequently loaded into the current Python process as a module and
returned from this function, ready for use.
By default, the directory to which the build file is emitted and the
resulting library compiled to is ``<tmp>/torch_extensions/<name>``, where
``<tmp>`` is the temporary folder on the current platform and ``<name>``
the name of the extension. This location can be overridden in two ways.
First, if the ``TORCH_EXTENSIONS_DIR`` environment variable is set, it
replaces ``<tmp>/torch_extensions`` and all extensions will be compiled
into subfolders of this directory. Second, if the ``build_directory``
argument to this function is supplied, it overrides the entire path, i.e.
the library will be compiled into that folder directly.
To compile the sources, the default system compiler (``c++``) is used,
which can be overridden by setting the ``CXX`` environment variable. To pass
additional arguments to the compilation process, ``extra_cflags`` or
``extra_ldflags`` can be provided. For example, to compile your extension
with optimizations, pass ``extra_cflags=['-O3']``. You can also use
``extra_cflags`` to pass further include directories.
CUDA support with mixed compilation is provided. Simply pass CUDA source
files (``.cu`` or ``.cuh``) along with other sources. Such files will be
detected and compiled with nvcc rather than the C++ compiler. This includes
passing the CUDA lib64 directory as a library directory, and linking
``cudart``. You can pass additional flags to nvcc via
``extra_cuda_cflags``, just like with ``extra_cflags`` for C++. Various
heuristics for finding the CUDA install directory are used, which usually
work fine. If not, setting the ``CUDA_HOME`` environment variable is the
safest option.
Args:
name: The name of the extension to build. This MUST be the same as the
name of the pybind11 module!
sources: A list of relative or absolute paths to C++ source files.
extra_cflags: optional list of compiler flags to forward to the build.
extra_cuda_cflags: optional list of compiler flags to forward to nvcc
when building CUDA sources.
extra_ldflags: optional list of linker flags to forward to the build.
extra_include_paths: optional list of include directories to forward
to the build.
build_directory: optional path to use as build workspace.
verbose: If ``True``, turns on verbose logging of load steps.
with_cuda: Determines whether CUDA headers and libraries are added to
the build. If set to ``None`` (default), this value is
automatically determined based on the existence of ``.cu`` or
``.cuh`` in ``sources``. Set it to `True`` to force CUDA headers
and libraries to be included.
is_python_module: If ``True`` (default), imports the produced shared
library as a Python module. If ``False``, behavior depends on
``is_standalone``.
is_standalone: If ``False`` (default) loads the constructed extension
into the process as a plain dynamic library. If ``True``, build a
standalone executable.
Returns:
If ``is_python_module`` is ``True``:
Returns the loaded PyTorch extension as a Python module.
If ``is_python_module`` is ``False`` and ``is_standalone`` is ``False``:
Returns nothing. (The shared library is loaded into the process as
a side effect.)
If ``is_standalone`` is ``True``.
Return the path to the executable. (On Windows, TORCH_LIB_PATH is
added to the PATH environment variable as a side effect.)
Example:
>>> # xdoctest: +SKIP
>>> from torch.utils.cpp_extension import load
>>> module = load(
... name='extension',
... sources=['extension.cpp', 'extension_kernel.cu'],
... extra_cflags=['-O2'],
... verbose=True)
'''
> return _jit_compile(
name,
[sources] if isinstance(sources, str) else sources,
extra_cflags,
extra_cuda_cflags,
extra_ldflags,
extra_include_paths,
build_directory or _get_build_directory(name, verbose),
verbose,
with_cuda,
is_python_module,
is_standalone,
keep_intermediates=keep_intermediates)
/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1308:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
name = 'custom_extensions'
sources = ['/home/ubuntu/microxcaling/mx/cpp/funcs.cpp', '/home/ubuntu/microxcaling/mx/cpp/mx.cu', '/home/ubuntu/microxcaling/mx/cpp/elemwise.cu', '/home/ubuntu/microxcaling/mx/cpp/reduce.cu']
extra_cflags = None, extra_cuda_cflags = None, extra_ldflags = None, extra_include_paths = None
build_directory = '/home/ubuntu/.cache/torch_extensions/py39_cu121/custom_extensions', verbose = False, with_cuda = True, is_python_module = True
is_standalone = False, keep_intermediates = True
def _jit_compile(name,
sources,
extra_cflags,
extra_cuda_cflags,
extra_ldflags,
extra_include_paths,
build_directory: str,
verbose: bool,
with_cuda: Optional[bool],
is_python_module,
is_standalone,
keep_intermediates=True) -> None:
if is_python_module and is_standalone:
raise ValueError("`is_python_module` and `is_standalone` are mutually exclusive.")
if with_cuda is None:
with_cuda = any(map(_is_cuda_file, sources))
with_cudnn = any('cudnn' in f for f in extra_ldflags or [])
old_version = JIT_EXTENSION_VERSIONER.get_version(name)
version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
name,
sources,
build_arguments=[extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths],
build_directory=build_directory,
with_cuda=with_cuda,
is_python_module=is_python_module,
is_standalone=is_standalone,
)
if version > 0:
if version != old_version and verbose:
print(f'The input conditions for extension module {name} have changed. ' +
f'Bumping to version {version} and re-building as {name}_v{version}...',
file=sys.stderr)
name = f'{name}_v{version}'
if version != old_version:
baton = FileBaton(os.path.join(build_directory, 'lock'))
if baton.try_acquire():
try:
with GeneratedFileCleaner(keep_intermediates=keep_intermediates) as clean_ctx:
if IS_HIP_EXTENSION and (with_cuda or with_cudnn):
hipify_result = hipify_python.hipify(
project_directory=build_directory,
output_directory=build_directory,
header_include_dirs=(extra_include_paths if extra_include_paths is not None else []),
extra_files=[os.path.abspath(s) for s in sources],
ignores=[_join_rocm_home('*'), os.path.join(_TORCH_PATH, '*')], # no need to hipify ROCm or PyTorch headers
show_detailed=verbose,
show_progress=verbose,
is_pytorch_extension=True,
clean_ctx=clean_ctx
)
hipified_sources = set()
for source in sources:
s_abs = os.path.abspath(source)
hipified_sources.add(hipify_result[s_abs].hipified_path if s_abs in hipify_result else s_abs)
sources = list(hipified_sources)
_write_ninja_file_and_build_library(
name=name,
sources=sources,
extra_cflags=extra_cflags or [],
extra_cuda_cflags=extra_cuda_cflags or [],
extra_ldflags=extra_ldflags or [],
extra_include_paths=extra_include_paths or [],
build_directory=build_directory,
verbose=verbose,
with_cuda=with_cuda,
is_standalone=is_standalone)
finally:
baton.release()
else:
> baton.wait()
/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1724:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <torch.utils.file_baton.FileBaton object at 0x7f0c763a4b20>
def wait(self):
'''
Periodically sleeps for a certain amount until the baton is released.
The amount of time slept depends on the ``wait_seconds`` parameter
passed to the constructor.
'''
while os.path.exists(self.lock_file_path):
> time.sleep(self.wait_seconds)
E KeyboardInterrupt
/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/file_baton.py:42: KeyboardInterrupt
============================================================ 40 passed, 3 warnings in 13.45s
just in case - ninja version on both servers is: Requirement already satisfied: ninja in /opt/conda/envs/pytorch/lib/python3.9/site-packages (1.11.1)
The multi-GPU failure is inside pytorch when it tries to build the custom extensions. Are multiple processes trying to run the test simultaneously? It looks like multiple processes are trying to build the CUDA code at once and there's a race condition. I don't think Pytorch support multiple processes calling cpp_extensions.load() to build the CUDA code simultaneously. For thread safety in our code, we've always had it so one process executes the following to build the CUDA code first:
if torch.distributed.get_rank() == 0:
import mx.custom_extensions
torch.distributed.barrier()
If that's the problem I don't think there's a solution to the tests hanging in a multi-GPU environment. I will document the thread safety issue above.
I'm not running anything related to torch.distributed here (hence without that there is not a concept of local_rank to do the check you mentioned above) and only the 0 gpu should be active. I'm just firing the test which should only activate gpu:0 regardless of how many gpu's present. Wondering if there's a need to spec how items are moving to cuda.
Wondering if there's a need to spec how items are moving to cuda.
Do you mean how tensors are moving to the GPU?
In the multi-GPU setup, if you do:
import mx.custom_extensions
Does it hang?
Hi @rizhao-msft - Yes, simply importing will generate the hang:
Gave it about 5 mins to make sure... it's the exact same deadlock:
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/microxcaling/examples/../mx/custom_extensions.py", line 19, in <module>
funcs = load(name="custom_extensions", sources=sources)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
return _jit_compile(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1724, in _jit_compile
baton.wait()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/file_baton.py", line 42, in wait
time.sleep(self.wait_seconds)
KeyboardInterrupt
Interesting, we've tested on multi-GPU machine before and never had this issue. The hang can happen when using torch.distributed to do multi-GPU training, but never just importing it in a single process.
I'll take a look to see if I can reproduce.
cool - for reference, verified that the same simple import above has no issue on single gpu machine (maybe 30 second delay and then ready).
Hi @lessw2020
A couple of questions - what pytorch version are you using? Can you try 2.1.0?
Does it still happen when you set CUDA_VISIBLE_DEVICES=0 before the command? Or if you use the PYTHONPATH=PATH_TO_MX_FOLDER?
If you paste the output of pip list format=freeze that might help us track down what is happening too.
Wondering if there's a need to spec how items are moving to cuda.
The underlying tensors are still normal pytorch tensors, so nothing special should be needed there.
Hi @mgolub2-ms: I'm running with latest pytorch nightlies (1106). Installing 2.1 now to see if that fixes.
deadlock is the same with 2.1.0 (bit relieved that is the case...going back to nightlies).
>>> import mx.custom_extensions
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/microxcaling/examples/../mx/custom_extensions.py", line 19, in <module>
funcs = load(name="custom_extensions", sources=sources)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
return _jit_compile(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1724, in _jit_compile
baton.wait()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/file_baton.py", line 42, in wait
time.sleep(self.wait_seconds)
KeyboardInterrupt
Does it still happen when you set CUDA_VISIBLE_DEVICES=0 before the command?
Yes, no change, same deadlock.
Here's pip freeze for reference on this server. But I can repro this on multiple servers, I don't think there's anything specific here. (A10 EC2, A100 on premise). You can spin up a $5/hr G5 A10 (4 gpu) on EC2 and repro this, might be helpful at this point as then you could investigate directly. One common aspect is I'm running CUDA 12.1 on all servers. I also have triton nightly build (open ai triton) installed on both servers...not sure if either could be an issue/delta, but if you are running only Cuda 11.8 then perhaps that might explain difference?
Package Version Editable project location
---------------------------------- ----------------------- --------------------------------
absl-py 1.4.0
accelerate 0.24.1
aiohttp 3.8.4
aiosignal 1.3.1
aniso8601 9.0.1
ansi2html 1.8.0
anyio 3.6.2
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.2.1
async-timeout 4.0.2
attrs 23.1.0
awscli 1.22.101
Babel 2.12.1
backcall 0.2.0
backports.functools-lru-cache 1.6.4
beautifulsoup4 4.12.2
bitsandbytes 0.41.0
black 22.3.0
bleach 6.0.0
blinker 1.6.2
blis 0.7.9
blobfile 2.0.2
bokeh 2.4.3
boto3 1.26.142
botocore 1.29.142
Brotli 1.0.9
brotlipy 0.7.0
cachetools 5.3.1
captum 0.5.0
catalogue 2.0.8
cbor2 5.4.6
certifi 2023.5.7
cffi 1.15.1
cfgv 3.4.0
charset-normalizer 3.1.0
click 8.1.3
cloudpickle 2.2.1
cmake 3.25.0
colorama 0.4.3
comm 0.1.3
commonmark 0.9.1
confection 0.0.4
contextlib2 21.6.0
coverage 7.3.0
cryptography 40.0.2
cycler 0.11.0
cymem 2.0.7
DALL-E 0.1
dataclasses 0.8
datasets 2.13.1
debugpy 1.6.7
decorator 5.1.1
deepspeed 0.9.5
defusedxml 0.7.1
diffusion 6.9.1
diffusion-core 0.0.28
dill 0.3.6
distlib 0.3.7
docker-pycreds 0.4.0
docutils 0.15.2
dparse 0.6.2
e 1.4.5
einops 0.6.1
entrypoints 0.4
evaluate 0.4.0
exceptiongroup 1.1.2
executing 1.2.0
fairscale 0.4.5
fastai 2.1.10
fastcore 1.5.29
fastjsonschema 2.17.1
fastprogress 1.0.3
filelock 3.12.2
flake8 4.0.1
flake8-bugbear 22.4.25
flake8-polyfill 1.0.2
Flask 2.3.2
Flask-RESTful 0.3.10
flit_core 3.9.0
fonttools 4.39.4
frozenlist 1.3.3
fsspec 2023.5.0
future 0.18.3
gekko 1.0.6
gitdb 4.0.10
GitPython 3.1.31
google-auth 2.22.0
google-auth-oauthlib 1.0.0
google-pasta 0.2.0
grpcio 1.56.0
gym 0.26.2
gym-notices 0.0.8
h5py 3.6.0
hjson 3.1.0
horovod 0.28.0
huggingface-hub 0.17.3
identify 2.5.26
idna 3.4
imageio 2.16.2
importlib-metadata 4.13.0
importlib-resources 5.12.0
inflate64 0.3.1
iniconfig 2.0.0
install 1.3.5
iopath 0.1.10
ipykernel 6.23.1
ipython 8.13.2
ipython-genutils 0.2.0
ipywidgets 8.0.6
isort 5.12.0
itsdangerous 2.1.2
jedi 0.18.2
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.2.0
json5 0.9.5
jsonschema 4.17.3
jupyter_client 8.2.0
jupyter_core 5.3.0
jupyter-server 1.23.6
jupyterlab 3.3.4
jupyterlab-pygments 0.2.2
jupyterlab_server 2.22.1
jupyterlab-widgets 3.0.7
kiwisolver 1.4.4
langcodes 3.3.0
libcst 1.0.1
lit 15.0.7
llvmlite 0.40.0
lxml 4.9.3
Markdown 3.4.3
MarkupSafe 2.1.2
matplotlib 3.5.3
matplotlib-inline 0.1.6
mccabe 0.6.1
memory-efficient-attention-pytorch 0.1.6
mistune 2.0.5
moreorless 0.4.0
mpmath 1.2.1
multidict 6.0.4
multiprocess 0.70.14
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.9
mypy 1.0.1
mypy-extensions 1.0.0
nbclassic 0.5.6
nbclient 0.8.0
nbconvert 7.4.0
nbformat 5.8.0
nest-asyncio 1.5.6
networkx 3.0
ninja 1.11.1
nltk 3.8.1
nodeenv 1.8.0
notebook 6.4.12
notebook_shim 0.2.3
numba 0.57.0
numpy 1.24.3
nvgpu 0.9.0
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.2.10.91
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.4.91
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu11 11.7.91
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
opt-einsum 3.3.0
packaging 23.1
pandas 1.4.4
pandocfilters 1.5.0
parso 0.8.3
pathos 0.3.0
pathspec 0.11.1
pathtools 0.1.2
pathy 0.10.1
patsy 0.5.3
peft 0.6.0
pep8-naming 0.12.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.5.0
pip 22.3.1
pkgutil_resolve_name 1.3.10
platformdirs 3.10.0
pluggy 1.2.0
ply 3.11
portalocker 2.7.0
pox 0.3.2
ppft 1.7.6.6
pre-commit 3.3.3
pre-commit-hooks 4.4.0
preshed 3.0.8
prometheus-client 0.17.0
prompt-toolkit 3.0.38
protobuf 3.20.2
protobuf3-to-dict 0.1.5
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
py-cpuinfo 9.0.0
py7zr 0.20.5
pyarrow 12.0.0
pyasn1 0.4.8
pyasn1-modules 0.3.0
pybcj 1.0.1
pybind11 2.9.2
pybind11-global 2.9.2
pycodestyle 2.8.0
pycparser 2.21
pycryptodomex 3.18.0
pydantic 1.10.8
pyflakes 2.4.0
pyfunctional 1.4.3
pygame 2.1.3
Pygments 2.15.1
pynvml 11.5.0
pyOpenSSL 23.1.1
pyparsing 3.0.9
pyppmd 1.0.0
PyQt5 5.15.7
PyQt5-sip 12.11.0
pyre-extensions 0.0.29
pyrsistent 0.19.3
PySocks 1.7.1
pytesseract 0.3.10
pytest 7.4.2
pytest-cov 4.1.0
pytest-mock 3.8.2
python-dateutil 2.8.2
pytorch-triton 2.1.0+6e4932cda8
pytz 2023.3
PyYAML 6.0
pyzmq 25.0.2
pyzstd 0.15.9
regex 2023.5.5
requests 2.31.0
requests-oauthlib 1.3.1
responses 0.18.0
rich 12.6.0
rouge 1.0.1
rouge-score 0.1.2
rsa 4.7.2
ruamel.yaml 0.17.28
ruamel.yaml.clib 0.2.7
s3fs 0.4.2
s3transfer 0.6.1
sacremoses 0.0.53
safetensors 0.3.1
sagemaker 2.159.0
schema 0.7.5
scikit-learn 1.0
scipy 1.11.3
seaborn 0.12.2
Send2Trash 1.8.2
sentencepiece 0.1.99
sentry-sdk 1.26.0
setproctitle 1.3.2
setuptools 67.7.2
shap 0.40.0
shellingham 1.5.1
sip 6.7.9
six 1.16.0
slicer 0.0.7
smart-open 5.2.1
smclarify 0.5
smdebug-rulesconfig 1.0.1
smmap 5.0.0
sniffio 1.3.0
soupsieve 2.3.2.post1
spacy 3.5.3
spacy-legacy 3.0.12
spacy-loggers 1.0.4
srsly 2.4.6
stack-data 0.6.2
statsmodels 0.14.0
stdlibs 2022.10.9
stringcase 1.2.0
structlog 21.5.0
sympy 1.11.1
tabulate 0.9.0
tblib 1.7.0
tensorboard 2.13.0
tensorboard-data-server 0.7.1
termcolor 2.3.0
terminado 0.17.1
texttable 1.6.7
thinc 8.1.10
threadpoolctl 3.1.0
tiktoken 0.4.0
timm 0.9.2
tinycss2 1.2.1
tokenizers 0.14.1
toml 0.10.2
tomli 2.0.1
tomlkit 0.12.1
torch 2.1.0
torch-model-archiver 0.5.3b20220226
torch-workflow-archiver 0.2.8b20230512
torchaudio 2.1.0
torchmultimodal 0.1.0b0 /home/ubuntu/PyTorch_MultiModal
torchserve 0.6.0b20220513
torchtext 0.14.1
torchvision 0.16.0
tornado 6.3.2
tqdm 4.63.2
trailrunner 1.4.0
traitlets 5.9.0
transformer_nuggets 0.0.1 /home/ubuntu/transformer_nuggets
transformers 4.35.0
triton 2.1.0
triton-nightly 2.1.0.dev20231012235740
typer 0.7.0
typing_extensions 4.6.2
typing-inspect 0.9.0
ufmt 1.3.0
unicodedata2 15.0.0
urllib3 1.26.15
usort 1.0.2
virtualenv 20.24.3
vit-pytorch 1.2.2
wandb 0.15.4
wasabi 1.1.1
wcwidth 0.2.6
webencodings 0.5.1
websocket-client 1.5.2
Werkzeug 2.3.4
wget 3.2
wheel 0.40.0
widgetsnbextension 4.0.7
xformers 0.0.20
xxhash 3.2.0
yarl 1.9.2
zipp 3.15.0
Here's pip freeze for reference on this server. But I can repro this on multiple servers, I don't think there's anything specific here. (A10 EC2, A100 on premise). You can spin up a $5/hr G5 A10 (4 gpu) on EC2 and repro this, might be helpful at this point as then you could investigate directly. One common aspect is I'm running CUDA 12.1 on all servers. I also have triton nightly build (open ai triton) installed on both servers...not sure if either could be an issue/delta, but if you are running only Cuda 11.8 then perhaps that might explain difference?
@rizhao-msft Spun up a multi GPU (2) azure VM and was unable to reproduce the hang - we are continuing to investigate
Hmm, this is odd - I've been unable to reproduce the issue either, using cuda 12.1, torch 2.1.0, on a 4 GPU A100 azure VM.
I tried configuring accelerate and install triton/pytorch triton, no issues running the tests, or running the examples. I've also tried setting the path using sys.path.append and PYTHONPATH, to test if there was an issue there, both work for me.
One thing I noticed that is a bit odd - you have some cuda 11 packages and some cuda 12 packages:
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.2.10.91
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.4.91
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu11 11.7.91
nvidia-nvtx-cu12 12.1.105
For comparison, my environment only has cuda 12 packages:
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 11.525.112
nvidia-ml-py3 7.352.0
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu12 12.1.105
Is it possible to try an environment with a single set of cuda packages? Or even a brand new virtualenv and then install the requirements using pip install -r requirements.txt
?
Not sure why multi-gpu would be affected by that vs. a single gpu. It's really interesting that CUDA_VISIBLE_DEVICES does not help - to my knowledge, that basically emulates a single GPU system for any command.
Thanks for the updates @rizhao-msft and @mgolub2-ms.
I'll pull a brand new EC2 instance this afternoon to help isolate this.
I believe the multi-cuda (11 and 12) is from the starting AWS AMI (i.e. they have a bundle in there), not something I overtly installed.
But, at least starting with a brand new image will help further reduce the search space.
HI @rizhao-msft and @mgolub2-ms
Some good news here.
Clean new machine works great. (with Pytorch 2.1).
I'll upgrade to nightlies next but at least for now this hang issue is localized to something in my configs.
Here's the pip freeze from new AWS instance + pip install -r requirements.txt so there's a snapshot to diff against:
import sys sys.path.append('..') import mx.custom_extensions exit() (pytorch) ubuntu@ip-172-31-27-89:~/microxcaling/mx$ pip list format=freeze Package Version
aniso8601 9.0.1 annotated-types 0.6.0 ansi2html 1.8.0 anyio 4.0.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 async-lru 2.0.4 attrs 23.1.0 awscli 1.29.76 Babel 2.13.1 backports.functools-lru-cache 1.6.5 beautifulsoup4 4.12.2 bleach 6.1.0 blinker 1.7.0 bokeh 3.3.0 boto3 1.28.76 botocore 1.31.76 Brotli 1.1.0 cached-property 1.5.2 captum 0.6.0 certifi 2023.7.22 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 2.2.1 colorama 0.4.4 comm 0.1.4 contextlib2 21.6.0 contourpy 1.1.1 cryptography 41.0.5 cycler 0.12.1 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.7 docutils 0.15.2 dparse 0.6.3 entrypoints 0.4 exceptiongroup 1.1.3 executing 2.0.1 fastjsonschema 2.18.1 filelock 3.13.1 Flask 3.0.0 Flask-RESTful 0.3.10 fonttools 4.43.1 fqdn 1.5.1 fsspec 2023.10.0 gmpy2 2.1.2 google-pasta 0.2.0 gym 0.26.2 gym-notices 0.0.8 idna 3.4 imageio 2.31.5 importlib-metadata 6.8.0 importlib-resources 6.1.0 iniconfig 2.0.0 ipykernel 6.26.0 ipython 8.17.2 ipython-genutils 0.2.0 ipywidgets 8.1.1 isoduration 20.11.0 itsdangerous 2.1.2 jedi 0.19.1 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 json5 0.9.14 jsonpointer 2.4 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter_client 8.5.0 jupyter_core 5.5.0 jupyter-events 0.8.0 jupyter-lsp 2.2.0 jupyter_server 2.9.1 jupyter_server_terminals 0.4.4 jupyterlab 4.0.8 jupyterlab-pygments 0.2.2 jupyterlab_server 2.25.0 jupyterlab-widgets 3.0.9 kiwisolver 1.4.5 llvmlite 0.41.1 MarkupSafe 2.1.3 matplotlib 3.8.1 matplotlib-inline 0.1.6 mistune 3.0.2 mpmath 1.3.0 multiprocess 0.70.15 munkres 1.1.4 nbclassic 1.0.0 nbclient 0.8.0 nbconvert 7.10.0 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.2.1 ninja 1.11.1.1 notebook 6.5.4 notebook_shim 0.2.3 numba 0.58.1 numpy 1.26.0 nvgpu 0.10.0 overrides 7.4.0 packaging 21.3 pandas 2.1.2 pandocfilters 1.5.0 parso 0.8.3 pathos 0.3.1 patsy 0.5.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.1.0 pip 23.3.1 pkgutil_resolve_name 1.3.10 platformdirs 3.11.0 pluggy 1.3.0 pox 0.3.3 ppft 1.7.6.7 prometheus-client 0.18.0 prompt-toolkit 3.0.39 protobuf 4.25.0 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 14.0.0 pyasn1 0.5.0 pybind11 2.11.1 pybind11-global 2.11.1 pycparser 2.21 pydantic 2.4.2 pydantic_core 2.10.1 pyfunctional 1.4.3 pygame 2.5.2 Pygments 2.16.1 pynvml 11.5.0 pyparsing 3.1.1 PySocks 1.7.1 pytest 7.4.3 python-dateutil 2.8.2 python-json-logger 2.0.7 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.1 referencing 0.30.2 requests 2.31.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rpds-py 0.10.6 rsa 4.7.2 ruamel.yaml 0.18.4 ruamel.yaml.clib 0.2.8 s3fs 0.4.2 s3transfer 0.7.0 sagemaker 2.196.0 schema 0.7.5 scikit-learn 1.3.2 scipy 1.11.3 seaborn 0.13.0 Send2Trash 1.8.2 setuptools 68.2.2 shap 0.43.0 six 1.16.0 slicer 0.0.7 smclarify 0.5 smdebug-rulesconfig 1.0.1 sniffio 1.3.0 soupsieve 2.5 stack-data 0.6.2 statsmodels 0.14.0 sympy 1.12 tabulate 0.9.0 tblib 1.7.0 termcolor 2.3.0 terminado 0.17.1 threadpoolctl 3.2.0 tinycss2 1.2.1 tomli 2.0.1 torch 2.1.0 torch-model-archiver 0.7.1b20230208 torch-workflow-archiver 0.2.11b20231012 torchaudio 2.1.0 torchdata 0.7.0 torchserve 0.9.0b20231012 torchtext 0.16.0 torchvision 0.16.0 tornado 6.3.3 tqdm 4.66.1 traitlets 5.9.0 triton 2.1.0 types-python-dateutil 2.8.19.14 typing_extensions 4.8.0 typing-utils 0.1.0 tzdata 2023.3 unicodedata2 15.1.0 uri-template 1.3.0 urllib3 1.26.18 wcwidth 0.2.9 webcolors 1.13 webencodings 0.5.1 websocket-client 1.6.4 Werkzeug 3.0.1 wheel 0.41.3 widgetsnbextension 4.0.9 xyzservices 2023.10.1 zipp 3.17.0
ok working great with latest PT nightly (1108) as well. I also don't find any of the cuda libs running around on this new machine, so it seems that having those may have been the issue? (may have come from building PyTorch from source). Anyway, at least for EC2, this issue is resolved. I will setup a new env for on prem and see if that also resolves.
Good! We also have an alternative way to build the CUDA extensions, on the branch dev/rizhao/prebuilt_extensions. You can switch to that branch, then go to mx/cpp, and run python setup.py install
. This will install the CUDA extension as a Python package, and it will avoid running the load() function that was hanging.
This could help if you run into the problem again.
That sounds great re: prebuilt - I'll switch to that branch. I'll start converting a toy model tomorrow to start actively running with MX FP6 now that I'm past the hang. I'll go ahead and close this one since it's clear it's not a generic issue like I initially thought based on it showing on 2 diff servers. Thanks for the help on getting past the hanging issue!
I ran the unit test suites but this resulted in hangs...debugging a bit, I isolated it to the test_activations.py and specifically gelu and tanh unit tests.
and
I thought perhaps it was due to running on A10 gpus, so I changed to A100 but hit the exact same hangs.
I'm running on latest PT nightlies, not sure if that has an impact here or not. From manually running the other tests relevant for LLM's, these all passed with no issues.