chensongzh commented 4 years ago

Dear developers,

The model training has been failed all the time on one of my data sets. There is no error message but "Killed" at the end. Here is the complete input and output of model training:

topaz train -n 100 --num-workers=8 --train-images processed/micrographs/ --train-targets particles.txt --save-prefix=save_model/model -o save_model/model_training.txt

Loading model: resnet8

Model parameters: units=32, dropout=0.0, bn=on

Loading pretrained model: resnet8_u32

Receptive field: 71

Using device=0 with cuda=True

Loaded 6038 training micrographs with 1000 labeled particles

source split p_observed num_positive_regions total_regions

0 train 2.17e-05 29000 1339089526

Specified expected number of particle per micrograph = 100.0

With radius = 3

Setting pi = 0.01307619816301961

minibatch_size=256, epoch_size=1000, num_epochs=10

Killed

Any suggestions?

Thanks in advance.

tbepler commented 4 years ago

This sounds like an issue with your system rather than topaz. Are you running topaz on a cluster?

chensongzh commented 4 years ago

Yes, I am running on a cluster. However, I don't have any issues running the tutorial and another data set on the cluster.

tbepler commented 4 years ago

My first guess would be that your dataset is larger than the tutorial dataset so you are exceeding the RAM and/or time allocation of your job on the cluster. Unfortunately, I can only be of limited help in debugging cluster related issues. You'll probably get better help by contacting your cluster administrator.

tbepler commented 4 years ago

I'm going to close this issue, but feel free to re-open it if it turns out this problem was topaz related.

tbepler / topaz

Model Training Failed #48

Loading model: resnet8

Model parameters: units=32, dropout=0.0, bn=on

Loading pretrained model: resnet8_u32

Receptive field: 71

Using device=0 with cuda=True

Loaded 6038 training micrographs with 1000 labeled particles

source split p_observed num_positive_regions total_regions

0 train 2.17e-05 29000 1339089526

Specified expected number of particle per micrograph = 100.0

With radius = 3

Setting pi = 0.01307619816301961

minibatch_size=256, epoch_size=1000, num_epochs=10