`example-dvc-experiments`: reproducibility of results

When trying example-dvc-experiments, I'm getting much worse results than what I see for the baseline or for the other experiments I see in the docs:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Experiment                         ┃ Created      ┃    loss ┃    acc ┃ train.epochs ┃ model.conv_units ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ workspace                          │ -            │  4.1566 │ 0.7103 │ 10           │ 512              │
│ baseline-experiment                │ Sep 06, 2021 │ 0.23657 │ 0.9127 │ 10           │ 16               │
│ ├── 750c7fd [exp-5ac8e]            │ 08:29 PM     │  4.1566 │ 0.7103 │ 10           │ 512              │
│ ├── b7360be [model.conv_units=48]  │ 08:26 PM     │  3.6616 │ 0.7113 │ 10           │ 256              │
│ ├── 2e29365 [model.conv_units=512] │ 08:26 PM     │  3.6616 │ 0.7113 │ 10           │ 256              │
│ ├── f915fc6 [exp-44136]            │ Oct 26, 2021 │  3.6616 │ 0.7113 │ 10           │ 256              │
│ ├── 4c1c209 [exp-622a9]            │ Oct 26, 2021 │  3.6209 │ 0.7069 │ 10           │ 24               │
│ ├── 32b517e [exp-97bf9]            │ Oct 26, 2021 │  3.8937 │ 0.7085 │ 10           │ 128              │
│ ├── 40022a2 [exp-68b28]            │ Oct 26, 2021 │  3.9879 │ 0.7094 │ 10           │ 32               │
│ ├── d6b4c37 [exp-b53ed]            │ Oct 26, 2021 │  3.8086 │ 0.7074 │ 10           │ 64               │
│ ├── 7c1c0be [exp-90211]            │ Oct 26, 2021 │  3.6616 │ 0.7113 │ 10           │ 256              │
│ └── 8a1a4c2 [exp-96581]            │ Oct 26, 2021 │ 0.23657 │ 0.9127 │ 10           │ 16               │
└────────────────────────────────────┴──────────────┴─────────┴────────┴──────────────┴──────────────────┘

@iesahin Any idea what I'm doing wrong? I am on an M1 Mac, but I doubt it could make that big a difference.

Is this with the default data set?

The # of conv units more than 96 or some value decreases the performance a bit, but this is something different

Is this with the default data set?

Yes. All I did was dvc pull followed by dvc exp run commands.

Here's a much simpler reproducible example (skipping some irrelevant output from the extract stage):

$ git clone git@github.com:iterative/example-dvc-experiments.git
$ cd example-dvc-experiments
$ dvc pull
$ dvc exp run -f
...
Running stage 'train':
> python3 src/train.py
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
reshape (Reshape)            (None, 28, 28, 1)         0
_________________________________________________________________
conv2d (Conv2D)              (None, 26, 26, 32)        320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0
_________________________________________________________________
dropout (Dropout)            (None, 13, 13, 32)        0
_________________________________________________________________
flatten (Flatten)            (None, 5408)              0
_________________________________________________________________
dense (Dense)                (None, 128)               692352
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290
=================================================================
Total params: 693,962
Trainable params: 693,962
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
469/469 [==============================] - 6s 12ms/step - loss: 0.6642 - acc: 0.7757 - val_loss: 2.2720 - val_acc: 0.6829
Epoch 2/10
469/469 [==============================] - 5s 12ms/step - loss: 0.3515 - acc: 0.8727 - val_loss: 2.8037 - val_acc: 0.6881
Epoch 3/10
469/469 [==============================] - 5s 12ms/step - loss: 0.3108 - acc: 0.8866 - val_loss: 2.6615 - val_acc: 0.6938
Epoch 4/10
469/469 [==============================] - 5s 12ms/step - loss: 0.2834 - acc: 0.8959 - val_loss: 3.0042 - val_acc: 0.6975
Epoch 5/10
469/469 [==============================] - 5s 12ms/step - loss: 0.2603 - acc: 0.9024 - val_loss: 2.7750 - val_acc: 0.7030
Epoch 6/10
469/469 [==============================] - 5s 12ms/step - loss: 0.2486 - acc: 0.9080 - val_loss: 3.4210 - val_acc: 0.7045
Epoch 7/10
469/469 [==============================] - 6s 12ms/step - loss: 0.2354 - acc: 0.9126 - val_loss: 3.2173 - val_acc: 0.7043
Epoch 8/10
469/469 [==============================] - 6s 13ms/step - loss: 0.2227 - acc: 0.9156 - val_loss: 3.6057 - val_acc: 0.7061
Epoch 9/10
469/469 [==============================] - 6s 12ms/step - loss: 0.2112 - acc: 0.9204 - val_loss: 3.5412 - val_acc: 0.7105
Epoch 10/10
469/469 [==============================] - 6s 12ms/step - loss: 0.1951 - acc: 0.9273 - val_loss: 3.6238 - val_acc: 0.7069
79/79 [==============================] - 0s 4ms/step - loss: 3.6238 - acc: 0.7069
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add data/images.tar.gz.dvc metrics.json dvc.lock logs.csv dvc.yaml src/train.py params.yaml data/images

Reproduced experiment(s): exp-54e7e
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

        dvc exp branch <exp> <branch>

$ dvc exp show
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Experiment              ┃ Created      ┃    loss ┃    acc ┃ train.epochs ┃ model.conv_units ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ workspace               │ -            │  3.6238 │ 0.7069 │ 10           │ 16               │
│ baseline-experiment     │ Sep 06, 2021 │ 0.23657 │ 0.9127 │ 10           │ 16               │
│ └── 86616cb [exp-54e7e] │ 10:53 AM     │  3.6238 │ 0.7069 │ 10           │ 16               │
└─────────────────────────┴──────────────┴─────────┴────────┴──────────────┴──────────────────┘

I couldn't reproduce the problem, but there is another, major issue. model.conv_units from params.yaml are not passed to get_model function, meaning it returns the default model with 32 conv units always. This wasn't the case when I was testing the repo. I remember because when I run dvc exp run -S model.conv_units=256, the experiment was noticeably slower.

Looking at the history, train.py seems not to have changed since I created it in July, and this is a bit weird too. I'll try to understand the reason and revise the files.

I'm fixing this and will take a look at it on M1 Mac when I test this. Thank you.

My recent results are like this:

The model is:

(Yours have 32 as the number of conv units.)

Could the problem be about this: https://github.com/apple/tensorflow_macos/issues/55

and the following lines (90-93) in train.py?

    training_labels = tf.keras.utils.to_categorical(
        training_labels, num_classes=10, dtype="float32")
    testing_labels = tf.keras.utils.to_categorical(
        testing_labels, num_classes=10, dtype="float32")

Could you test with dtype="float16" for the time being. I can remove this categorical conversion and use another set of layers.

I rebuilt the repository with updated train.py. While pushing the experiments, I'm getting the following error:

Will take a look into this tomorrow.

I worked on this a bit, install TF and TF-Metal, but I got a set of exceptions:

Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
reshape (Reshape)            (None, 28, 28, 1)         0
_________________________________________________________________
conv2d (Conv2D)              (None, 26, 26, 256)       2560
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 256)       0
_________________________________________________________________
dropout (Dropout)            (None, 13, 13, 256)       0
_________________________________________________________________
flatten (Flatten)            (None, 43264)             0
_________________________________________________________________
dense (Dense)                (None, 128)               5537920
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290
=================================================================
Total params: 5,541,770
Trainable params: 5,541,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
2021-11-01 19:18:51.582 python3[3747:33381] -[MPSGraph adamUpdateWithLearningRateTensor:beta1Tensor:beta2Tensor:epsilonTensor:beta1PowerTensor:beta2PowerTensor:valuesTensor:momentumTensor:velocityTensor:maximumVelocityTensor:gradientTensor:name:]: unrecognized selector sent to instance 0x16bf4a520
2021-11-01 19:18:51.603 python3[3747:33381] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[MPSGraph adamUpdateWithLearningRateTensor:beta1Tensor:beta2Tensor:epsilonTensor:beta1PowerTensor:beta2PowerTensor:valuesTensor:momentumTensor:velocityTensor:maximumVelocityTensor:gradientTensor:name:]: unrecognized selector sent to instance 0x16bf4a520'
*** First throw call stack:
(
    0   CoreFoundation                      0x000000018e66b838 __exceptionPreprocess + 240
    1   libobjc.A.dylib                     0x000000018e3950a8 objc_exception_throw + 60
    2   CoreFoundation                      0x000000018e6fc694 -[NSObject(NSObject) __retain_OA] + 0
    3   CoreFoundation                      0x000000018e5cccd4 ___forwarding___ + 1444
    4   CoreFoundation                      0x000000018e5cc670 _CF_forwarding_prep_0 + 96
    5   libmetal_plugin.dylib               0x0000000155f5a290 _ZN12metal_plugin14MPSApplyAdamOpIfEC2EPNS_20OpKernelConstructionE + 656
    6   libmetal_plugin.dylib               0x0000000155f59ebc _ZN12metal_pluginL14CreateOpKernelINS_14MPSApplyAdamOpIfEEEEPvP23TF_OpKernelConstruction + 52
    7   libtensorflow_framework.2.dylib     0x00000001300f85d4 _ZN10tensorflow12_GLOBAL__N_120KernelBuilderFactory6CreateEPNS_20OpKernelConstructionE + 88
    8   libtensorflow_framework.2.dylib     0x000000013017a158 _ZN10tensorflow14CreateOpKernelENS_10DeviceTypeEPNS_10DeviceBaseEPNS_9AllocatorEPNS_22FunctionLibraryRuntimeEPNS_11ResourceMgrERKNSt3__110shared_ptrIKNS_14NodePropertiesEEEiPPNS_8OpKernelE + 784
    9   libtensorflow_framework.2.dylib     0x00000001303552b8 _ZN10tensorflow21CreateNonCachedKernelEPNS_6DeviceEPNS_22FunctionLibraryRuntimeERKNSt3__110shared_ptrIKNS_14NodePropertiesEEEiPPNS_8OpKernelE + 272
    10  libtensorflow_framework.2.dylib     0x00000001302ffc20 _ZN10tensorflow26FunctionLibraryRuntimeImpl12CreateKernelERKNSt3__110shared_ptrIKNS_14NodePropertiesEEEPNS_22FunctionLibraryRuntimeEPPNS_8OpKernelE + 600
    11  libtensorflow_framework.2.dylib     0x000000013036a430 _ZN10tensorflow22ImmutableExecutorState10InitializeERKNS_5GraphE + 1192
    12  libtensorflow_framework.2.dylib     0x0000000130355064 _ZN10tensorflow16NewLocalExecutorERKNS_19LocalExecutorParamsERKNS_5GraphEPPNS_8ExecutorE + 304
    13  libtensorflow_framework.2.dylib     0x0000000130362e6c _ZN10tensorflow12_GLOBAL__N_124DefaultExecutorRegistrar7Factory11NewExecutorERKNS_19LocalExecutorParamsERKNS_5GraphEPNSt3__110unique_ptrINS_8ExecutorENS9_14default_deleteISB_EEEE + 48
    14  libtensorflow_framework.2.dylib     0x00000001303637e8 _ZN10tensorflow11NewExecutorERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEERKNS_19LocalExecutorParamsERKNS_5GraphEPNS0_10unique_ptrINS_8ExecutorENS0_14default_deleteISG_EEEE + 92
    15  libtensorflow_framework.2.dylib     0x0000000130302278 _ZN10tensorflow26FunctionLibraryRuntimeImpl10CreateItemEPPNS0_4ItemE + 2676
    16  libtensorflow_framework.2.dylib     0x000000013030306c _ZN10tensorflow26FunctionLibraryRuntimeImpl3RunERKNS_22FunctionLibraryRuntime7OptionsEyN4absl12lts_202103244SpanIKNS_6TensorEEEPNSt3__16vectorIS8_NSB_9allocatorIS8_EEEENSB_8functionIFvRKNS_6StatusEEEE + 676
    17  libtensorflow_framework.2.dylib     0x00000001303110c0 _ZNK10tensorflow29ProcessFunctionLibraryRuntime14RunMultiDeviceERKNS_22FunctionLibraryRuntime7OptionsEyPNSt3__16vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEENS5_9allocatorISC_EEEEPNS6_INS5_10unique_ptrINS0_11CleanUpItemENS5_14default_deleteISI_EEEENSD_ISL_EEEENS5_8functionIFvRKNS_6StatusEEEENSP_IFSQ_RKNS0_21ComponentFunctionDataEPNS0_12InternalArgsEEEE + 2640
    18  libtensorflow_framework.2.dylib     0x0000000130314098 _ZNK10tensorflow29ProcessFunctionLibraryRuntime3RunERKNS_22FunctionLibraryRuntime7OptionsEyN4absl12lts_202103244SpanIKNS_6TensorEEEPNSt3__16vectorIS8_NSB_9allocatorIS8_EEEENSB_8functionIFvRKNS_6StatusEEEE + 2012
    19  libtensorflow_framework.2.dylib     0x0000000130314868 _ZNK10tensorflow29ProcessFunctionLibraryRuntime7RunSyncERKNS_22FunctionLibraryRuntime7OptionsEyN4absl12lts_202103244SpanIKNS_6TensorEEEPNSt3__16vectorIS8_NSB_9allocatorIS8_EEEE + 160
    20  _pywrap_tensorflow_internal.so      0x000000011b71d554 _ZN10tensorflow19KernelAndDeviceFunc3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPNSt3__16vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEENS6_9allocatorISD_EEEEPNS_19CancellationManagerERKNS9_8optionalINS_25EagerRemoteFunctionParamsEEERKNSK_INS_17ManagedStackTraceEEE + 516
    21  _pywrap_tensorflow_internal.so      0x000000011b6e7d60 _ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ENSt3__19allocatorIS6_EEEERKNS3_8optionalINS_25EagerRemoteFunctionParamsEEERKNS7_10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSD_INS_17ManagedStackTraceEEE + 372
    22  _pywrap_tensorflow_internal.so      0x000000011b6ee3c4 _ZN10tensorflow11ExecuteNode3RunEv + 396
    23  _pywrap_tensorflow_internal.so      0x000000011ba29764 _ZN10tensorflow13EagerExecutor11SyncExecuteEPNS_9EagerNodeE + 172
    24  _pywrap_tensorflow_internal.so      0x000000011b6e789c _ZN10tensorflow12_GLOBAL__N_117EagerLocalExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi + 1976
    25  _pywrap_tensorflow_internal.so      0x000000011b6e5a44 _ZN10tensorflow12EagerExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi + 296
    26  _pywrap_tensorflow_internal.so      0x000000011b34aba4 _ZN10tensorflow14EagerOperation7ExecuteEN4absl12lts_202103244SpanIPNS_20AbstractTensorHandleEEEPi + 192
    27  _pywrap_tensorflow_internal.so      0x000000011b72392c _ZN10tensorflow21CustomDeviceOpHandler7ExecuteEPNS_27ImmediateExecutionOperationEPPNS_30ImmediateExecutionTensorHandleEPi + 468
    28  _pywrap_tensorflow_internal.so      0x0000000117f6ff38 TFE_Execute + 80
    29  _pywrap_tensorflow_internal.so      0x0000000117eecac0 _Z24TFE_Py_ExecuteCancelableP11TFE_ContextPKcS2_PN4absl12lts_2021032413InlinedVectorIP16TFE_TensorHandleLm4ENSt3__19allocatorIS7_EEEEP7_objectP23TFE_CancellationManagerPNS5_IS7_Lm2ESA_EEP9TF_Status + 616
    30  _pywrap_tfe.so                      0x000000013158e41c _ZN10tensorflow32TFE_Py_ExecuteCancelable_wrapperERKN8pybind116handleEPKcS5_S3_S3_PNS_19CancellationManagerES3_ + 160
    31  _pywrap_tfe.so                      0x00000001315bf208 _ZZN8pybind1112cpp_function10initializeIZL25pybind11_init__pywrap_tfeRNS_7module_EE4$_44NS_6objectEJRKNS_6handleEPKcSA_S8_S8_S8_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE_8__invokeESR_ + 184
    32  _pywrap_tfe.so                      0x00000001315a10e0 _ZN8pybind1112cpp_function10dispatcherEP7_objectS2_S2_ + 3216
    33  python3                             0x00000001028d3214 cfunction_call + 80
    34  python3                             0x000000010287fa84 _PyObject_MakeTpCall + 340
    35  python3                             0x000000010298f60c call_function + 724
    36  python3                             0x000000010298bca4 _PyEval_EvalFrameDefault + 29268
    37  python3                             0x0000000102984408 _PyEval_EvalCode + 2968
    38  python3                             0x0000000102880700 _PyFunction_Vectorcall + 240
    39  python3                             0x000000010298f574 call_function + 572
    40  python3                             0x000000010298bda0 _PyEval_EvalFrameDefault + 29520
    41  python3                             0x0000000102984408 _PyEval_EvalCode + 2968
    42  python3                             0x0000000102880700 _PyFunction_Vectorcall + 240
    43  python3                             0x000000010288358c method_vectorcall + 164
    44  python3                             0x000000010298f574 call_function + 572
    45  python3                             0x000000010298bda0 _PyEval_EvalFrameDefault + 29520
    46  python3                             0x0000000102984408 _PyEval_EvalCode + 2968
    47  python3                             0x0000000102880700 _PyFunction_Vectorcall + 240
    48  python3                             0x000000010288358c method_vectorcall + 164
    49  python3                             0x000000010298f574 call_function + 572
    50  python3                             0x000000010298bda0 _PyEval_EvalFrameDefault + 29520
    51  python3                             0x0000000102984408 _PyEval_EvalCode + 2968
    52  python3                             0x0000000102880700 _PyFunction_Vectorcall + 240
    53  python3                             0x000000010287fd04 _PyObject_FastCallDictTstate + 320
    54  python3                             0x0000000102880a7c _PyObject_Call_Prepend + 164
    55  python3                             0x00000001028f71c8 slot_tp_call + 376
    56  python3                             0x00000001028804d0 _PyObject_Call + 156
    57  python3                             0x000000010298bfd8 _PyEval_EvalFrameDefault + 30088
    58  python3                             0x0000000102984408 _PyEval_EvalCode + 2968
    59  python3                             0x0000000102880700 _PyFunction_Vectorcall + 240
    60  python3                             0x00000001028836ec method_vectorcall + 516
    61  python3                             0x000000010298bfd8 _PyEval_EvalFrameDefault + 30088
    62  python3                             0x0000000102984408 _PyEval_EvalCode + 2968
    63  python3                             0x0000000102880700 _PyFunction_Vectorcall + 240
    64  python3                             0x000000010287fd04 _PyObject_FastCallDictTstate + 320
    65  python3                             0x0000000102880a7c _PyObject_Call_Prepend + 164
    66  python3                             0x00000001028f71c8 slot_tp_call + 376
    67  python3                             0x000000010287fa84 _PyObject_MakeTpCall + 340
    68  python3                             0x000000010298f60c call_function + 724
    69  python3                             0x000000010298bca4 _PyEval_EvalFrameDefault + 29268
    70  python3                             0x0000000102984408 _PyEval_EvalCode + 2968
    71  python3                             0x0000000102880700 _PyFunction_Vectorcall + 240
    72  python3                             0x000000010288358c method_vectorcall + 164
    73  python3                             0x000000010298f574 call_function + 572
    74  python3                             0x000000010298bda0 _PyEval_EvalFrameDefault + 29520
    75  python3                             0x0000000102880780 function_code_fastcall + 116
    76  python3                             0x000000010298f574 call_function + 572
    77  python3                             0x000000010298bd24 _PyEval_EvalFrameDefault + 29396
    78  python3                             0x0000000102984408 _PyEval_EvalCode + 2968
    79  python3                             0x00000001029e77dc pyrun_file + 376
    80  python3                             0x00000001029e6cf0 PyRun_SimpleFileExFlags + 816
    81  python3                             0x0000000102a09eb0 Py_RunMain + 2916
    82  python3                             0x0000000102a0b044 pymain_main + 1272
    83  python3                             0x00000001028266d0 main + 56
    84  libdyld.dylib                       0x000000018e50d430 start + 4
)
libc++abi: terminating with uncaught exception of type NSException
[1]    3747 abort      python3 src/train.py

It looks I need to upgrade my system: https://github.com/tensorflow/tensorflow/issues/50196 Will take a look after the upgrade.

Yes, just tested out with 12.0.1 and it works as expected now! Thanks for the research.

Although I was able to upgrade my system finally, I still couldn't install tensorflow on macOS, getting some weird errors like:

ERROR: Cannot install tensorflow-macos==2.5.0 and tensorflow-macos==2.6.0 because these package versions h
ave conflicting dependencies.

The conflict is caused by:
    tensorflow-macos 2.6.0 depends on h5py~=3.1.0
    tensorflow-macos 2.5.0 depends on h5py~=3.1.0

I'm closing this issue if it works for you as expected @dberenbaum

It works for me, although I remember having a lot of trouble getting tensorflow installed on macos. Maybe there's some existing tensorflow macos installation that's causing the conflict? Are you following https://developer.apple.com/metal/tensorflow-plugin/?

Yes, but mine is an older M1 installation, and probably I made some customization along the way that now makes it difficult to install. I'm closing this. Thank you.

iterative / example-repos-dev

`example-dvc-experiments`: reproducibility of results #93