Closed dberenbaum closed 2 years ago
Is this with the default data set?
The # of conv units more than 96 or some value decreases the performance a bit, but this is something different
Is this with the default data set?
Yes. All I did was dvc pull
followed by dvc exp run
commands.
Here's a much simpler reproducible example (skipping some irrelevant output from the extract stage):
$ git clone git@github.com:iterative/example-dvc-experiments.git
$ cd example-dvc-experiments
$ dvc pull
$ dvc exp run -f
...
Running stage 'train':
> python3 src/train.py
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
reshape (Reshape) (None, 28, 28, 1) 0
_________________________________________________________________
conv2d (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
_________________________________________________________________
dropout (Dropout) (None, 13, 13, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 5408) 0
_________________________________________________________________
dense (Dense) (None, 128) 692352
_________________________________________________________________
dense_1 (Dense) (None, 10) 1290
=================================================================
Total params: 693,962
Trainable params: 693,962
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
469/469 [==============================] - 6s 12ms/step - loss: 0.6642 - acc: 0.7757 - val_loss: 2.2720 - val_acc: 0.6829
Epoch 2/10
469/469 [==============================] - 5s 12ms/step - loss: 0.3515 - acc: 0.8727 - val_loss: 2.8037 - val_acc: 0.6881
Epoch 3/10
469/469 [==============================] - 5s 12ms/step - loss: 0.3108 - acc: 0.8866 - val_loss: 2.6615 - val_acc: 0.6938
Epoch 4/10
469/469 [==============================] - 5s 12ms/step - loss: 0.2834 - acc: 0.8959 - val_loss: 3.0042 - val_acc: 0.6975
Epoch 5/10
469/469 [==============================] - 5s 12ms/step - loss: 0.2603 - acc: 0.9024 - val_loss: 2.7750 - val_acc: 0.7030
Epoch 6/10
469/469 [==============================] - 5s 12ms/step - loss: 0.2486 - acc: 0.9080 - val_loss: 3.4210 - val_acc: 0.7045
Epoch 7/10
469/469 [==============================] - 6s 12ms/step - loss: 0.2354 - acc: 0.9126 - val_loss: 3.2173 - val_acc: 0.7043
Epoch 8/10
469/469 [==============================] - 6s 13ms/step - loss: 0.2227 - acc: 0.9156 - val_loss: 3.6057 - val_acc: 0.7061
Epoch 9/10
469/469 [==============================] - 6s 12ms/step - loss: 0.2112 - acc: 0.9204 - val_loss: 3.5412 - val_acc: 0.7105
Epoch 10/10
469/469 [==============================] - 6s 12ms/step - loss: 0.1951 - acc: 0.9273 - val_loss: 3.6238 - val_acc: 0.7069
79/79 [==============================] - 0s 4ms/step - loss: 3.6238 - acc: 0.7069
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add data/images.tar.gz.dvc metrics.json dvc.lock logs.csv dvc.yaml src/train.py params.yaml data/images
Reproduced experiment(s): exp-54e7e
Experiment results have been applied to your workspace.
To promote an experiment to a Git branch run:
dvc exp branch <exp> <branch>
$ dvc exp show
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Experiment ┃ Created ┃ loss ┃ acc ┃ train.epochs ┃ model.conv_units ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ workspace │ - │ 3.6238 │ 0.7069 │ 10 │ 16 │
│ baseline-experiment │ Sep 06, 2021 │ 0.23657 │ 0.9127 │ 10 │ 16 │
│ └── 86616cb [exp-54e7e] │ 10:53 AM │ 3.6238 │ 0.7069 │ 10 │ 16 │
└─────────────────────────┴──────────────┴─────────┴────────┴──────────────┴──────────────────┘
I couldn't reproduce the problem, but there is another, major issue. model.conv_units
from params.yaml
are not passed to get_model
function, meaning it returns the default model with 32
conv units always. This wasn't the case when I was testing the repo. I remember because when I run dvc exp run -S model.conv_units=256
, the experiment was noticeably slower.
Looking at the history, train.py
seems not to have changed since I created it in July, and this is a bit weird too. I'll try to understand the reason and revise the files.
I'm fixing this and will take a look at it on M1 Mac when I test this. Thank you.
My recent results are like this:
The model is:
(Yours have 32
as the number of conv units.)
Could the problem be about this: https://github.com/apple/tensorflow_macos/issues/55
and the following lines (90-93) in train.py
?
training_labels = tf.keras.utils.to_categorical(
training_labels, num_classes=10, dtype="float32")
testing_labels = tf.keras.utils.to_categorical(
testing_labels, num_classes=10, dtype="float32")
Could you test with dtype="float16"
for the time being. I can remove this categorical conversion and use another set of layers.
I rebuilt the repository with updated train.py. While pushing the experiments, I'm getting the following error:
Will take a look into this tomorrow.
I worked on this a bit, install TF and TF-Metal, but I got a set of exceptions:
Metal device set to: Apple M1
systemMemory: 16.00 GB
maxCacheSize: 5.33 GB
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
reshape (Reshape) (None, 28, 28, 1) 0
_________________________________________________________________
conv2d (Conv2D) (None, 26, 26, 256) 2560
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 256) 0
_________________________________________________________________
dropout (Dropout) (None, 13, 13, 256) 0
_________________________________________________________________
flatten (Flatten) (None, 43264) 0
_________________________________________________________________
dense (Dense) (None, 128) 5537920
_________________________________________________________________
dense_1 (Dense) (None, 10) 1290
=================================================================
Total params: 5,541,770
Trainable params: 5,541,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
2021-11-01 19:18:51.582 python3[3747:33381] -[MPSGraph adamUpdateWithLearningRateTensor:beta1Tensor:beta2Tensor:epsilonTensor:beta1PowerTensor:beta2PowerTensor:valuesTensor:momentumTensor:velocityTensor:maximumVelocityTensor:gradientTensor:name:]: unrecognized selector sent to instance 0x16bf4a520
2021-11-01 19:18:51.603 python3[3747:33381] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[MPSGraph adamUpdateWithLearningRateTensor:beta1Tensor:beta2Tensor:epsilonTensor:beta1PowerTensor:beta2PowerTensor:valuesTensor:momentumTensor:velocityTensor:maximumVelocityTensor:gradientTensor:name:]: unrecognized selector sent to instance 0x16bf4a520'
*** First throw call stack:
(
0 CoreFoundation 0x000000018e66b838 __exceptionPreprocess + 240
1 libobjc.A.dylib 0x000000018e3950a8 objc_exception_throw + 60
2 CoreFoundation 0x000000018e6fc694 -[NSObject(NSObject) __retain_OA] + 0
3 CoreFoundation 0x000000018e5cccd4 ___forwarding___ + 1444
4 CoreFoundation 0x000000018e5cc670 _CF_forwarding_prep_0 + 96
5 libmetal_plugin.dylib 0x0000000155f5a290 _ZN12metal_plugin14MPSApplyAdamOpIfEC2EPNS_20OpKernelConstructionE + 656
6 libmetal_plugin.dylib 0x0000000155f59ebc _ZN12metal_pluginL14CreateOpKernelINS_14MPSApplyAdamOpIfEEEEPvP23TF_OpKernelConstruction + 52
7 libtensorflow_framework.2.dylib 0x00000001300f85d4 _ZN10tensorflow12_GLOBAL__N_120KernelBuilderFactory6CreateEPNS_20OpKernelConstructionE + 88
8 libtensorflow_framework.2.dylib 0x000000013017a158 _ZN10tensorflow14CreateOpKernelENS_10DeviceTypeEPNS_10DeviceBaseEPNS_9AllocatorEPNS_22FunctionLibraryRuntimeEPNS_11ResourceMgrERKNSt3__110shared_ptrIKNS_14NodePropertiesEEEiPPNS_8OpKernelE + 784
9 libtensorflow_framework.2.dylib 0x00000001303552b8 _ZN10tensorflow21CreateNonCachedKernelEPNS_6DeviceEPNS_22FunctionLibraryRuntimeERKNSt3__110shared_ptrIKNS_14NodePropertiesEEEiPPNS_8OpKernelE + 272
10 libtensorflow_framework.2.dylib 0x00000001302ffc20 _ZN10tensorflow26FunctionLibraryRuntimeImpl12CreateKernelERKNSt3__110shared_ptrIKNS_14NodePropertiesEEEPNS_22FunctionLibraryRuntimeEPPNS_8OpKernelE + 600
11 libtensorflow_framework.2.dylib 0x000000013036a430 _ZN10tensorflow22ImmutableExecutorState10InitializeERKNS_5GraphE + 1192
12 libtensorflow_framework.2.dylib 0x0000000130355064 _ZN10tensorflow16NewLocalExecutorERKNS_19LocalExecutorParamsERKNS_5GraphEPPNS_8ExecutorE + 304
13 libtensorflow_framework.2.dylib 0x0000000130362e6c _ZN10tensorflow12_GLOBAL__N_124DefaultExecutorRegistrar7Factory11NewExecutorERKNS_19LocalExecutorParamsERKNS_5GraphEPNSt3__110unique_ptrINS_8ExecutorENS9_14default_deleteISB_EEEE + 48
14 libtensorflow_framework.2.dylib 0x00000001303637e8 _ZN10tensorflow11NewExecutorERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEERKNS_19LocalExecutorParamsERKNS_5GraphEPNS0_10unique_ptrINS_8ExecutorENS0_14default_deleteISG_EEEE + 92
15 libtensorflow_framework.2.dylib 0x0000000130302278 _ZN10tensorflow26FunctionLibraryRuntimeImpl10CreateItemEPPNS0_4ItemE + 2676
16 libtensorflow_framework.2.dylib 0x000000013030306c _ZN10tensorflow26FunctionLibraryRuntimeImpl3RunERKNS_22FunctionLibraryRuntime7OptionsEyN4absl12lts_202103244SpanIKNS_6TensorEEEPNSt3__16vectorIS8_NSB_9allocatorIS8_EEEENSB_8functionIFvRKNS_6StatusEEEE + 676
17 libtensorflow_framework.2.dylib 0x00000001303110c0 _ZNK10tensorflow29ProcessFunctionLibraryRuntime14RunMultiDeviceERKNS_22FunctionLibraryRuntime7OptionsEyPNSt3__16vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEENS5_9allocatorISC_EEEEPNS6_INS5_10unique_ptrINS0_11CleanUpItemENS5_14default_deleteISI_EEEENSD_ISL_EEEENS5_8functionIFvRKNS_6StatusEEEENSP_IFSQ_RKNS0_21ComponentFunctionDataEPNS0_12InternalArgsEEEE + 2640
18 libtensorflow_framework.2.dylib 0x0000000130314098 _ZNK10tensorflow29ProcessFunctionLibraryRuntime3RunERKNS_22FunctionLibraryRuntime7OptionsEyN4absl12lts_202103244SpanIKNS_6TensorEEEPNSt3__16vectorIS8_NSB_9allocatorIS8_EEEENSB_8functionIFvRKNS_6StatusEEEE + 2012
19 libtensorflow_framework.2.dylib 0x0000000130314868 _ZNK10tensorflow29ProcessFunctionLibraryRuntime7RunSyncERKNS_22FunctionLibraryRuntime7OptionsEyN4absl12lts_202103244SpanIKNS_6TensorEEEPNSt3__16vectorIS8_NSB_9allocatorIS8_EEEE + 160
20 _pywrap_tensorflow_internal.so 0x000000011b71d554 _ZN10tensorflow19KernelAndDeviceFunc3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPNSt3__16vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEENS6_9allocatorISD_EEEEPNS_19CancellationManagerERKNS9_8optionalINS_25EagerRemoteFunctionParamsEEERKNSK_INS_17ManagedStackTraceEEE + 516
21 _pywrap_tensorflow_internal.so 0x000000011b6e7d60 _ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ENSt3__19allocatorIS6_EEEERKNS3_8optionalINS_25EagerRemoteFunctionParamsEEERKNS7_10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSD_INS_17ManagedStackTraceEEE + 372
22 _pywrap_tensorflow_internal.so 0x000000011b6ee3c4 _ZN10tensorflow11ExecuteNode3RunEv + 396
23 _pywrap_tensorflow_internal.so 0x000000011ba29764 _ZN10tensorflow13EagerExecutor11SyncExecuteEPNS_9EagerNodeE + 172
24 _pywrap_tensorflow_internal.so 0x000000011b6e789c _ZN10tensorflow12_GLOBAL__N_117EagerLocalExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi + 1976
25 _pywrap_tensorflow_internal.so 0x000000011b6e5a44 _ZN10tensorflow12EagerExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi + 296
26 _pywrap_tensorflow_internal.so 0x000000011b34aba4 _ZN10tensorflow14EagerOperation7ExecuteEN4absl12lts_202103244SpanIPNS_20AbstractTensorHandleEEEPi + 192
27 _pywrap_tensorflow_internal.so 0x000000011b72392c _ZN10tensorflow21CustomDeviceOpHandler7ExecuteEPNS_27ImmediateExecutionOperationEPPNS_30ImmediateExecutionTensorHandleEPi + 468
28 _pywrap_tensorflow_internal.so 0x0000000117f6ff38 TFE_Execute + 80
29 _pywrap_tensorflow_internal.so 0x0000000117eecac0 _Z24TFE_Py_ExecuteCancelableP11TFE_ContextPKcS2_PN4absl12lts_2021032413InlinedVectorIP16TFE_TensorHandleLm4ENSt3__19allocatorIS7_EEEEP7_objectP23TFE_CancellationManagerPNS5_IS7_Lm2ESA_EEP9TF_Status + 616
30 _pywrap_tfe.so 0x000000013158e41c _ZN10tensorflow32TFE_Py_ExecuteCancelable_wrapperERKN8pybind116handleEPKcS5_S3_S3_PNS_19CancellationManagerES3_ + 160
31 _pywrap_tfe.so 0x00000001315bf208 _ZZN8pybind1112cpp_function10initializeIZL25pybind11_init__pywrap_tfeRNS_7module_EE4$_44NS_6objectEJRKNS_6handleEPKcSA_S8_S8_S8_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE_8__invokeESR_ + 184
32 _pywrap_tfe.so 0x00000001315a10e0 _ZN8pybind1112cpp_function10dispatcherEP7_objectS2_S2_ + 3216
33 python3 0x00000001028d3214 cfunction_call + 80
34 python3 0x000000010287fa84 _PyObject_MakeTpCall + 340
35 python3 0x000000010298f60c call_function + 724
36 python3 0x000000010298bca4 _PyEval_EvalFrameDefault + 29268
37 python3 0x0000000102984408 _PyEval_EvalCode + 2968
38 python3 0x0000000102880700 _PyFunction_Vectorcall + 240
39 python3 0x000000010298f574 call_function + 572
40 python3 0x000000010298bda0 _PyEval_EvalFrameDefault + 29520
41 python3 0x0000000102984408 _PyEval_EvalCode + 2968
42 python3 0x0000000102880700 _PyFunction_Vectorcall + 240
43 python3 0x000000010288358c method_vectorcall + 164
44 python3 0x000000010298f574 call_function + 572
45 python3 0x000000010298bda0 _PyEval_EvalFrameDefault + 29520
46 python3 0x0000000102984408 _PyEval_EvalCode + 2968
47 python3 0x0000000102880700 _PyFunction_Vectorcall + 240
48 python3 0x000000010288358c method_vectorcall + 164
49 python3 0x000000010298f574 call_function + 572
50 python3 0x000000010298bda0 _PyEval_EvalFrameDefault + 29520
51 python3 0x0000000102984408 _PyEval_EvalCode + 2968
52 python3 0x0000000102880700 _PyFunction_Vectorcall + 240
53 python3 0x000000010287fd04 _PyObject_FastCallDictTstate + 320
54 python3 0x0000000102880a7c _PyObject_Call_Prepend + 164
55 python3 0x00000001028f71c8 slot_tp_call + 376
56 python3 0x00000001028804d0 _PyObject_Call + 156
57 python3 0x000000010298bfd8 _PyEval_EvalFrameDefault + 30088
58 python3 0x0000000102984408 _PyEval_EvalCode + 2968
59 python3 0x0000000102880700 _PyFunction_Vectorcall + 240
60 python3 0x00000001028836ec method_vectorcall + 516
61 python3 0x000000010298bfd8 _PyEval_EvalFrameDefault + 30088
62 python3 0x0000000102984408 _PyEval_EvalCode + 2968
63 python3 0x0000000102880700 _PyFunction_Vectorcall + 240
64 python3 0x000000010287fd04 _PyObject_FastCallDictTstate + 320
65 python3 0x0000000102880a7c _PyObject_Call_Prepend + 164
66 python3 0x00000001028f71c8 slot_tp_call + 376
67 python3 0x000000010287fa84 _PyObject_MakeTpCall + 340
68 python3 0x000000010298f60c call_function + 724
69 python3 0x000000010298bca4 _PyEval_EvalFrameDefault + 29268
70 python3 0x0000000102984408 _PyEval_EvalCode + 2968
71 python3 0x0000000102880700 _PyFunction_Vectorcall + 240
72 python3 0x000000010288358c method_vectorcall + 164
73 python3 0x000000010298f574 call_function + 572
74 python3 0x000000010298bda0 _PyEval_EvalFrameDefault + 29520
75 python3 0x0000000102880780 function_code_fastcall + 116
76 python3 0x000000010298f574 call_function + 572
77 python3 0x000000010298bd24 _PyEval_EvalFrameDefault + 29396
78 python3 0x0000000102984408 _PyEval_EvalCode + 2968
79 python3 0x00000001029e77dc pyrun_file + 376
80 python3 0x00000001029e6cf0 PyRun_SimpleFileExFlags + 816
81 python3 0x0000000102a09eb0 Py_RunMain + 2916
82 python3 0x0000000102a0b044 pymain_main + 1272
83 python3 0x00000001028266d0 main + 56
84 libdyld.dylib 0x000000018e50d430 start + 4
)
libc++abi: terminating with uncaught exception of type NSException
[1] 3747 abort python3 src/train.py
It looks I need to upgrade my system: https://github.com/tensorflow/tensorflow/issues/50196 Will take a look after the upgrade.
Yes, just tested out with 12.0.1 and it works as expected now! Thanks for the research.
Although I was able to upgrade my system finally, I still couldn't install tensorflow on macOS, getting some weird errors like:
ERROR: Cannot install tensorflow-macos==2.5.0 and tensorflow-macos==2.6.0 because these package versions h
ave conflicting dependencies.
The conflict is caused by:
tensorflow-macos 2.6.0 depends on h5py~=3.1.0
tensorflow-macos 2.5.0 depends on h5py~=3.1.0
I'm closing this issue if it works for you as expected @dberenbaum
It works for me, although I remember having a lot of trouble getting tensorflow installed on macos. Maybe there's some existing tensorflow macos installation that's causing the conflict? Are you following https://developer.apple.com/metal/tensorflow-plugin/?
Yes, but mine is an older M1 installation, and probably I made some customization along the way that now makes it difficult to install. I'm closing this. Thank you.
When trying
example-dvc-experiments
, I'm getting much worse results than what I see for the baseline or for the other experiments I see in the docs:@iesahin Any idea what I'm doing wrong? I am on an M1 Mac, but I doubt it could make that big a difference.