Closed chensongzh closed 4 years ago
This sounds like an issue with your system rather than topaz. Are you running topaz on a cluster?
Yes, I am running on a cluster. However, I don't have any issues running the tutorial and another data set on the cluster.
My first guess would be that your dataset is larger than the tutorial dataset so you are exceeding the RAM and/or time allocation of your job on the cluster. Unfortunately, I can only be of limited help in debugging cluster related issues. You'll probably get better help by contacting your cluster administrator.
I'm going to close this issue, but feel free to re-open it if it turns out this problem was topaz related.
Dear developers,
The model training has been failed all the time on one of my data sets. There is no error message but "Killed" at the end. Here is the complete input and output of model training:
topaz train -n 100 --num-workers=8 --train-images processed/micrographs/ --train-targets particles.txt --save-prefix=save_model/model -o save_model/model_training.txt
Loading model: resnet8
Model parameters: units=32, dropout=0.0, bn=on
Loading pretrained model: resnet8_u32
Receptive field: 71
Using device=0 with cuda=True
Loaded 6038 training micrographs with 1000 labeled particles
source split p_observed num_positive_regions total_regions
0 train 2.17e-05 29000 1339089526
Specified expected number of particle per micrograph = 100.0
With radius = 3
Setting pi = 0.01307619816301961
minibatch_size=256, epoch_size=1000, num_epochs=10
Killed
Any suggestions?
Thanks in advance.