microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.07k stars 1.82k forks source link

Got Segmentation fault when running amc_search.py #4000

Closed twmht closed 2 years ago

twmht commented 3 years ago

Hi,

I have segmentation fault error when running amc_search

after debugging with faulthandler. It seems that the error is caused by scipy

Traceback (most recent call last):
    File "tools/amc_search.py", line 187, in <module>
    pruner.compress()
    File "/home/shared/nfs/acer-share/bushido/third_party/nni/nni/algorithms/compression/pytorch/pruning/amc/amc_pruner.py", line 210, in compress
    self.train(self.ddpg_args.train_episode, self.agent, self.env, self.output_dir)
    File "/home/shared/nfs/acer-share/bushido/third_party/nni/nni/algorithms/compression/pytorch/pruning/amc/amc_pruner.py", line 229, in train
    action  =  agent.select_action(observation, episode = episode)
    File "/home/shared/nfs/acer-share/bushido/third_party/nni/nni/algorithms/compression/pytorch/pruning/amc/lib/agent.py", line 186, in select_action
    action  =  self.sample_from_truncated_normal_distribution(lower = self.lbound, upper = self.rbound, mu = action, sigma = delta)
    File "/home/shared/nfs/acer-share/bushido/third_party/nni/nni/algorithms/compression/pytorch/pruning/amc/lib/agent.py", line 230, in sample_from_truncated_normal_distribution
    return stats.truncnorm.rvs((lower-mu)/sigma, (upper-mu)/sigma, loc = mu, scale = sigma, size = size)
    File "/home/acer/.pyenv/versions/pytorch/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 966, in rvs
    raise ValueError("Domain error in arguments.")
    ValueError: Domain error in arguments.

my scipy versio is 1.4.1 and nni version is v2.3.

any idea?

linbinskn commented 3 years ago

Have you modified the code of amc_search.py? In the same setting, I have no bug here.

twmht commented 3 years ago

@linbinskn

No modification except I used my own dataset.

twmht commented 3 years ago

@linbinskn

The error is from stats.truncnorm.rvs, which version of scipy you used?

linbinskn commented 3 years ago

I have tried 1.4.1, everything seems fine.

twmht commented 3 years ago

@linbinskn

I found out the action produced by the actor may be nan. thus mu would be nan (https://github.com/microsoft/nni/blob/master/nni/algorithms/compression/pytorch/pruning/amc/lib/agent.py#L228)

is it possible that actor may produce nan values (https://github.com/microsoft/nni/blob/master/nni/algorithms/compression/pytorch/pruning/amc/lib/agent.py#L32) ?

linbinskn commented 3 years ago

Exploding gradients will lead to NaN value.

twmht commented 3 years ago

@linbinskn

I switch to torch1.7.0 from torch.1.8.0 and the error is gone. there might be some problems between torch1.8.0 and nni v2.3.