Custom model training fails, need to downgrade torch (and setuptools)

Hi,

I am using the deepquestai/deepstack:gpu-2022.01.1 container to do custom training. It comes with torch for cuda 11.3 but train.py fails after initiation (see error below). This is resolved when I downgrade to torch for cuda 11.0 (pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html as per the collab notebook).

docker run --gpus all -it --rm -v /home/eouser/deepstack:/deepstack/code -w /deepstack/code/deepstack-trainer deepquestai/deepstack_updated:gpu python3 train.py --dataset-path /deepstack/code/data Traceback (most recent call last): File "train.py", line 530, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 90, in train model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc).to(device) # create File "/deepstack/code/deepstack-trainer/models/yolo.py", line 96, in init self._initialize_biases() # only run once File "/deepstack/code/deepstack-trainer/models/yolo.py", line 151, in _initialize_biases b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

I first need to downgrade setuptools inside the container, btw, because otherwise it throws:

Traceback (most recent call last): File "train.py", line 21, in from torch.utils.tensorboard import SummaryWriter File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/init.py", line 4, in LooseVersion = distutils.version.LooseVersion AttributeError: module 'setuptools._distutils' has no attribute 'version'

(resolved with: pip install setuptools==59.5.0)

I am now happily training with the revised setup, so nothing too urgent, but maybe worth checking out.

Thx for this wonderful framework!

Guido

t0mer / deepstack-trainer

Custom model training fails, need to downgrade torch (and setuptools) #15