Training failing - Githubissues

stsour commented 3 years ago

Hi there,

I was quite successful using DeepRTplus last year, but I am now having trouble working with a new dataset. I have attached my testing and training files. The training runs without error, but the resulting correlation with the test set is very low, 0.2-0.25.

I tried running the software again on the previous dataset I used with similar results. I tried re-cloning the software and starting with unmodified files, with no luck. I then tried with your datasets using

python data_split.py data/mod.txt 9 1 2
python capsule_network_emb.py

and the following error was thrown:

Traceback (most recent call last): File "capsule_network_emb_cpu.py", line 743, in engine.train(processor, get_rt_iterator(True), maxepoch=NUM_EPOCHS, optimizer=optimizer) File "/home/tsour.s/.local/lib/python3.7/site-packages/torchnet/engine/engine.py", line 63, in train state['optimizer'].step(closure) File "/home/tsour.s/.local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, kwargs) File "/home/tsour.s/.local/lib/python3.7/site-packages/torch/optim/adam.py", line 62, in step loss = closure() File "/home/tsour.s/.local/lib/python3.7/site-packages/torchnet/engine/engine.py", line 56, in closure self.hook('on_forward', state) File "/home/tsour.s/.local/lib/python3.7/site-packages/torchnet/engine/engine.py", line 31, in hook self.hooksname File "capsule_network_emb_cpu.py", line 575, in on_forward meter_loss.add(state['loss'].data[0]) IndexError: invalid index of a 0-dim tensor. Use tensor.item() in Python or tensor.item<T>() in C++ to convert a 0-dim tensor to a number**

Any idea what the issue might be?

Thanks!

horsepurve commented 3 years ago

This is caused by the out-of-date pytorch (<=0.4) this repo uses, see here for reference. For a more advanced pytorch (>=0.5), please try to modify as:

For CPU version: #575

meter_loss.add(state['loss'].data)

For GPU version: #575

meter_loss.add(state['loss'].cpu().data)

stsour commented 3 years ago

Thanks, that solved the error I was seeing. However, now I am getting this error when testing with your data:

RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 1] because the unspecified dimension size -1 can be any value and is ambiguous

I tried changing the batch size in capsule_network_emb_cpu.py, but this did not help.

horsepurve commented 3 years ago

By running which specific code there is this error? I tried capsule_network_emb_cpu.py and prediction_emb_cpu.py and didn't see this error. Maybe you can show the complete error message?

stsour commented 3 years ago

I was getting that error when running capsule_network_emb_cpu.py. I tried again and it worked on your dataset, but not mine. The losses do seem to be lower than before though.

Here is the complete error message I am getting. I also attached my training and testing data.

[> 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [40:22<00:00, 10.32it/s]

[Epoch 1] Training Loss: 0.0463 (MSE: 7.0000) [Epoch 1] Testing Loss: 0.0414 (MSE: 7.0000) Traceback (most recent call last): File "capsule_network_emb_cpu.py", line 743, in engine.train(processor, get_rt_iterator(True), maxepoch=NUM_EPOCHS, optimizer=optimizer) File "/home/tsour.s/.local/lib/python3.7/site-packages/torchnet/engine/engine.py", line 67, in train self.hook('on_end_epoch', state) File "/home/tsour.s/.local/lib/python3.7/site-packages/torchnet/engine/engine.py", line 31, in hook self.hooksname File "capsule_network_emb_cpu.py", line 671, in on_end_epoch pred_batch = model(test_batch) File "/home/tsour.s/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "capsule_network_emb_cpu.py", line 275, in forward x = self.primary_capsules(x) File "/home/tsour.s/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "capsule_network_emb_cpu.py", line 149, in forward outputs = [capsule(x).view(x.size(0), -1, 1) for capsule in self.capsules] File "capsule_network_emb_cpu.py", line 149, in outputs = [capsule(x).view(x.size(0), -1, 1) for capsule in self.capsules] RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 1] because the unspecified dimension size -1 can be any value and is ambiguous** ]

ST_DeepRTplus_test.txt ST_DeepRTplus_train.txt

horsepurve commented 3 years ago

Thanks for sharing the data, now I am able to reproduce the error. It turns out the cause is the same as before, the number of testing samples (100000) is divisible by the batch size (16), so please change line 632 to e.g.

PRED_BATCH = 166 # a number doesn't divide the sample size

It seems now the correlation is higher (~0.974 after the 1st epoch).

stsour commented 3 years ago

Ahhh, now I feel silly. I tested another batch size, but I guess it also by chance divided evenly. My mistake, I should have been more thorough checking that. Just tried it again, and it worked like a charm 👍

Thanks so much for your help, I really appreciate your responsiveness!

horsepurve / DeepRTplus

Training failing #12