kakaobrain / fast-autoaugment

Official Implementation of 'Fast AutoAugment' in PyTorch.
MIT License
1.59k stars 196 forks source link

Crashes in torch #27

Open vvigilante opened 4 years ago

vvigilante commented 4 years ago

Hello, running the code I encounter this message from the torch implementation.

what(): owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 INTERNAL ASSERT FAILED at /pytorch/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr: I used the suggested versions. Do you have some advice? Thank you.

ZergWang commented 4 years ago

Using the same version as the author, I also encountered this error...I will be grateful if someone could give me some suggestions.

ildoonet commented 4 years ago

@vvigilante @ZergWang Could you post your command and full stack-tracked error messages?

vvigilante commented 4 years ago

Here is the stack

terminate called after throwing an instance of 'c10::Error'
  what():  owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 INTERNAL ASSERT FAILED at /pytorch/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /pytorch/c10/util/intrusive_ptr.h:348)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb1c459f813 in /home/s4179447/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1838c8f (0x7fb15fd1ec8f in /home/s4179447/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: THStorage_free + 0x17 (0x7fb160447b37 in /home/s4179447/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x71b567 (0x7fb1bfb72567 in /home/s4179447/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #21: __libc_start_main + 0xf5 (0x7fb1cea80505 in /lib64/libc.so.6)
frame #22: python3() [0x400c3f]

*** Aborted at 1575239767 (unix time) try "date -d @1575239767" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x209897700006e50) received by PID 28240 (TID 0x7fb1cfc9c740) from PID 28240; stack trace: ***
    @     0x7fb1cf03e5f0 (unknown)
    @     0x7fb1cea94337 __GI_raise
    @     0x7fb1cea95a28 __GI_abort
    @     0x7fb1c4cd4e55 __gnu_cxx::__verbose_terminate_handler()
    @     0x7fb1c4cd2c46 __cxxabiv1::__terminate()
    @     0x7fb1c4cd2c91 std::terminate()
    @     0x7fb1c4cd2ed3 __cxa_throw
    @     0x7fb15fd1ecd9 c10::intrusive_ptr<>::reclaim()
    @     0x7fb160447b37 THStorage_free
    @     0x7fb1bfb72567 THCPFloatStorage_dealloc()
    @     0x7fb1cf83292d subtype_dealloc
    @     0x7fb1cf8127f7 free_keys_object
    @     0x7fb1cf813a00 dict_dealloc
    @     0x7fb1cf7ecb94 cell_dealloc
    @     0x7fb1cf801009 frame_dealloc
    @     0x7fb1cf8ce057 tb_dealloc
    @     0x7fb1cf8ce067 tb_dealloc
    @     0x7fb1cf8ce067 tb_dealloc
    @     0x7fb1cf89ea06 _PyEval_EvalFrameDefault
    @     0x7fb1cf8992f2 _PyEval_EvalCodeWithName
    @     0x7fb1cf8998ce PyEval_EvalCodeEx
    @     0x7fb1cf8998fb PyEval_EvalCode
    @     0x7fb1cf8c2c34 run_mod
    @     0x7fb1cf8c5105 PyRun_FileExFlags
    @     0x7fb1cf8c5265 PyRun_SimpleFileExFlags
    @     0x7fb1cf8db73d Py_Main
    @           0x400b87 main
    @     0x7fb1cea80505 __libc_start_main
    @           0x400c3f (unknown)

The command is invoked via slurm and looks like this:

#SBATCH -o "logs/log.%x.%j.log"
#SBATCH -t 70:00:00
#SBATCH --nodes=5 --ntasks=5
#SBATCH --partition=gpu
#SBATCH --gres=gpu:v100:1

module load Python/3.6.4-foss-2018a
module load CMake
module load CUDA/9.0.176
module load cuDNN/7.1.4.18-CUDA-9.0.176
module load FFmpeg/3.0.2-foss-2016a

nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )
node1=${nodes_array[0]}
ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # Making address
suffix=':6379'
ip_head=$ip_prefix$suffix
export ip_head

export REDIS_PASS="jgepeajkquefanfzmcpq3528v"

echo "Nodes are : "
echo $nodes
echo "--- head: $ip_head"

echo "Running master worker on node $node1..."
srun --ntasks=1 --nodes=1 --kill-on-bad-exit=1 -o "logs/log.%x.%j.%N.log" -w $node1 ray start --block --head --redis-port=6379 --redis-password=$REDIS_PASS &
echo "done"
sleep 5

worker_num=$(( $SLURM_NTASKS-1 ))

for ((  i=1; i<=$worker_num; i++ ))
do
  node2=${nodes_array[$i]}
  echo "Running slave worker on node $node2"
  srun --ntasks=1 --nodes=1 --kill-on-bad-exit=1 -o "logs/log.%x.%j.%N.log" -w $node2 ray start --block --address=$ip_head --redis-password=$REDIS_PASS &
  echo "done"
  sleep 5
done

echo "Running search script on node $node1"
export PYTHONPATH=.
python3 ./FastAutoAugment/search.py -c confs/senet50.yaml --redis $ip_head
echo "done"
ZergWang commented 4 years ago

My error occurred when running "Train without Augmentations". The code is to execute the test every five epochs. I found that the error occurred at the end of the test, maybe when the code was saving model. But the strange thing is that this error seems to occur randomly, it may happen on the 5th epoch test, or it may happen on the 50th epoch test...I guess there may be a problem with memory management? The command I used was to reproduce your results on CIFAR10: python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot /home/zergwang/Desktop/faa/dataset/cifar10

The error messages: ` terminate called after throwing an instance of 'c10::Error' what(): owning_ptr == NullType::singleton() || owningptr->refcount.load() > 0 INTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1565272279342/work/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /opt/conda/conda-bld/pytorch_1565272279342/work/c10/util/intrusive_ptr.h:348) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7effc586ce37 in /home/zergwang/softwares/anaconda3/envs/faa/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: + 0x142c7ce (0x7effc89397ce in /home/zergwang/softwares/anaconda3/envs/faa/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #2: THStorage_free + 0x17 (0x7effc90bab07 in /home/zergwang/softwares/anaconda3/envs/faa/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #3: + 0x598377 (0x7efff6d35377 in /home/zergwang/softwares/anaconda3/envs/faa/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aa5e (0x555a0e550a5e in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #5: + 0xf2198 (0x555a0e4a8198 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #6: + 0xe7e58 (0x555a0e49de58 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #7: + 0xf1b77 (0x555a0e4a7b77 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #8: + 0xf1a07 (0x555a0e4a7a07 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #9: + 0xf1a1d (0x555a0e4a7a1d in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #10: + 0xf1a1d (0x555a0e4a7a1d in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #11: _PyEval_EvalFrameDefault + 0x240b (0x555a0e57485b in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #12: PyEval_EvalCodeEx + 0x329 (0x555a0e54a9b9 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #13: PyEval_EvalCode + 0x1c (0x555a0e54b75c in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #14: + 0x215744 (0x555a0e5cb744 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #15: PyRun_FileExFlags + 0xa1 (0x555a0e5cbb41 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #16: PyRun_SimpleFileExFlags + 0x1c3 (0x555a0e5cbd43 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #17: Py_Main + 0x613 (0x555a0e5cf833 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #18: main + 0xee (0x555a0e49988e in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6) frame #19: __libc_start_main + 0xf0 (0x7f0005c0f830 in /lib/x86_64-linux-gnu/libc.so.6) frame #20: + 0x1c3160 (0x555a0e579160 in /home/zergwang/softwares/anaconda3/envs/faa/bin/python3.6)

Aborted at 1575444111 (unix time) try "date -d @1575444111" if you are using GNU date PC: @ 0x0 (unknown) SIGABRT (@0x3e900007f5c) received by PID 32604 (TID 0x7f00063de700) from PID 32604; stack trace: @ 0x7f0005fca390 (unknown) @ 0x7f0005c24428 gsignal @ 0x7f0005c2602a abort @ 0x7efff66d684a gnu_cxx::verbose_terminate_handler() @ 0x7efff66d4f47 cxxabiv1::terminate() @ 0x7efff66d4f7d std::terminate() @ 0x7efff66d515a cxa_throw @ 0x7effc893980f c10::intrusive_ptr<>::reclaim() @ 0x7effc90bab07 THStorage_free @ 0x7efff6d35377 THCPFloatStorage_dealloc() @ 0x555a0e550a5e subtype_dealloc @ 0x555a0e4a8198 dict_dealloc @ 0x555a0e49de58 cell_dealloc @ 0x555a0e4a7b77 frame_dealloc @ 0x555a0e4a7a07 tb_dealloc @ 0x555a0e4a7a1d tb_dealloc @ 0x555a0e4a7a1d tb_dealloc @ 0x555a0e57485b _PyEval_EvalFrameDefault @ 0x555a0e54a9b9 PyEval_EvalCodeEx @ 0x555a0e54b75c PyEval_EvalCode @ 0x555a0e5cb744 run_mod @ 0x555a0e5cbb41 PyRun_FileExFlags @ 0x555a0e5cbd43 PyRun_SimpleFileExFlags @ 0x555a0e5cf833 Py_Main @ 0x555a0e49988e main @ 0x7f0005c0f830 libc_start_main @ 0x555a0e579160 (unknown)

Process finished with exit code 134 (interrupted by signal 6: SIGABRT) `

vvigilante commented 4 years ago

I confirm that the occurence time is completely random. It can crash in 10 minutes or last for hours and crash then. I suspect that this is caused by some sort of race condition between the multiple training processes.

ildoonet commented 4 years ago

@vvigilante @ZergWang I have absolutely no clue on your problems.(It may be memory resource issue. Please check your available memory while running) Let me think about these for a while and get back to you.

gasvn commented 4 years ago

same problem encountered.

monkeyDemon commented 4 years ago

I also encountered this error. I can also confirm that the error occurred at the end of the test and occured randomly.

I guess the most possible reason is the different ways of using ray? in README.md it says: Please read ray's document to construct a proper ray cluster : https://github.com/ray-project/ray, and run search.py with the master's redis address.

Can you give a more detailed tutorial about “construct a proper ray cluster” ?

what I do is modified this line in search.py,maybe doing this way has problem?

ray.init(redis_address=args.redis)

ray.init(num_cpus=8, num_gpus=4)

Hope some suggestions, thanks!

gogo03 commented 4 years ago

I found the following codes in search.py is the key: for epoch in tqdm_epoch: while True: epochs = OrderedDict() for exp_idx in range(num_experiments): try: if os.path.exists(default_path[exp_idx]): latest_ckpt = torch.load(default_path[exp_idx]) epochs['default_exp%d' % (exp_idx + 1)] = latest_ckpt['epoch'] except: pass try: if os.path.exists(augment_path[exp_idx]): latest_ckpt = torch.load(augment_path[exp_idx]) epochs['augment_exp%d' % (exp_idx + 1)] = latest_ckpt['epoch'] except: pass

        tqdm_epoch.set_postfix(epochs)
        if len(epochs) == num_experiments*2 and min(epochs.values()) >= C.get()['epoch']:
            is_done = True
        if len(epochs) == num_experiments*2 and min(epochs.values()) >= epoch:
            break
        time.sleep(10)#------------------important!!!!!
    if is_done:
        break

if I comment "time.sleep(10)",the error will appear. maybe we can adjust the range of sleeping times to keep synchronization between the threads or something.

vvigilante commented 4 years ago

I found the following codes in search.py is the key: for epoch in tqdm_epoch: while True: epochs = OrderedDict() for exp_idx in range(num_experiments): try: if os.path.exists(default_path[exp_idx]): latest_ckpt = torch.load(default_path[exp_idx]) epochs['default_exp%d' % (exp_idx + 1)] = latest_ckpt['epoch'] except: pass try: if os.path.exists(augment_path[exp_idx]): latest_ckpt = torch.load(augment_path[exp_idx]) epochs['augment_exp%d' % (exp_idx + 1)] = latest_ckpt['epoch'] except: pass

        tqdm_epoch.set_postfix(epochs)
        if len(epochs) == num_experiments*2 and min(epochs.values()) >= C.get()['epoch']:
            is_done = True
        if len(epochs) == num_experiments*2 and min(epochs.values()) >= epoch:
            break
        time.sleep(10)#------------------important!!!!!
    if is_done:
        break

if I comment "time.sleep(10)",the error will appear. maybe we can adjust the range of sleeping times to keep synchronization between the threads or something.

My guess is the following: To the aim of progress visualization, search.py tries to read the checkpoint. If this happens while train.py is writing the checkpoint, the file is in an inconsistent state, and the torch.load functions fails ungracefully (torch bug).

Solution may be NOT using torch to keep track of the last epoch run (e.g. using a text file) or using some synchronization mechanism.

I'd ask the authors to share the os that they are using and possibly more info about the setup to help us solve this issue.

NightQing commented 4 years ago

same question, any progress ?

TOM-tym commented 4 years ago

@gogo03 Hi, I've got the same error here. Do you have any solutions? I've tried to comment "time.sleep(10)" but the error still occours.

dongprojectteam commented 11 months ago

I met this error when my batch size is too big. When I modify the batch size to 2, it was working