tensorflow / swift-models

Models and examples built with Swift for TensorFlow
Apache License 2.0
648 stars 148 forks source link

wip: resnet50 checkpointing #732

Closed brettkoonce closed 3 years ago

brettkoonce commented 3 years ago

@BradLarson basically, the training loop code requires var (not let) models to actually modify things --> when the save callback gets called we get a memory access conflict, eg:

Simultaneous accesses to 0x5581af6921e0, but modification requires exclusive access.
Previous access (a modification) started at ResNet50-ImageNet`<unavailable> + 14202123 (0x5581aebaa50b).
Current access (a read) started at:
0    libswiftCore.so                    0x00007efe35e9a980 swift_beginAccess + 479
1    ResNet50-ImageNet                  0x00005581aebab525 <unavailable> + 14206245
2    ResNet50-ImageNet                  0x00005581aebac738 <unavailable> + 14210872
3    ResNet50-ImageNet                  0x00005581aebac5e2 <unavailable> + 14210530
4    ResNet50-ImageNet                  0x00005581aebac784 <unavailable> + 14210948
5    ResNet50-ImageNet                  0x00005581aee04a47 <unavailable> + 16669255
6    ResNet50-ImageNet                  0x00005581aee0f598 <unavailable> + 16713112
7    ResNet50-ImageNet                  0x00005581aee04240 <unavailable> + 16667200
8    ResNet50-ImageNet                  0x00005581aee06b56 <unavailable> + 16677718
9    ResNet50-ImageNet                  0x00005581aebaa656 <unavailable> + 14202454
10   libc.so.6                          0x00007efe17c77ab0 __libc_start_main + 231
11   ResNet50-ImageNet                  0x00005581adecfaba <unavailable> + 723642
Fatal access conflict detected.
Aborted (core dumped)

Is there a lazy trick (eg shadow copy/mutex of some form) to deal with this, or am I using the API incorrectly/there a better place to deal with this?

brettkoonce commented 3 years ago

dispatch queues ftw