intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
16 stars 3 forks source link

Error loading model using Net.load #48

Open emartinezs44 opened 2 years ago

emartinezs44 commented 2 years ago

I think it was reported two years ago, but this error:

java.lang.NullPointerException was thrown. java.lang.NullPointerException at com.intel.analytics.bigdl.utils.serializer.ModuleLoader$.initTensorStorage(ModuleLoader.scala:122) at com.intel.analytics.bigdl.utils.serializer.ModuleLoader$.loadFromFile(ModuleLoader.scala:59) at com.intel.analytics.bigdl.nn.Module$.loadModule(Module.scala:61) at com.intel.analytics.bigdl.optim.DistriOptimizerSpec$$anonfun$17.apply$mcV$sp(DistriOptimizerSpec.scala:500) at com.intel.analytics.bigdl.optim.DistriOptimizerSpec$$anonfun$17.apply(DistriOptimizerSpec.scala:474) at com.intel.analytics.bigdl.optim.DistriOptimizerSpec$$anonfun$17.apply(DistriOptimizerSpec.scala:474)

still happens, when trying to load a model from a checkpoint. Is there any workaround for this?

I´m using

"com.intel.analytics.zoo" % "analytics-zoo-bigdl_0.12.2-spark_2.4.3" % "0.10.0"

jason-dai commented 2 years ago

@EmiCareOfCell44 are loading a BigDL model? can you share an example for us to reproduce the issue?

emartinezs44 commented 2 years ago

The problem is related to how a KerasNet model is saved every checkpoint, as I see in the code it still uses Java serialization, and the function from the Module class is even deprecated:

@deprecated("Java based serialization not recommended any more, please use loadModule instead", "0.3") def load[T: ClassTag](path : String) : AbstractModule[Activity, Activity, T] = { File.loadAbstractModule[Activity, Activity, T] }

The optimizer uses the function:

@deprecated("please use recommended saveModule(path, overWrite)", "0.3.0") def save(path : String, overWrite: Boolean = false) : this.type = { this.clearState() File.save(this, path, overWrite) this }

To store the model. So, you need to load and convert to a KerasNet[T] instance to resume the training. Although I have problems when the batch size is major to one per core. It is necessary to change the way that the optimizer creates the checkpoint using saveModule, that uses protobuf, and more advanced features.

qiuxin2012 commented 2 years ago

@EmiCareOfCell44 We have to two couples of save/load functions. One couple are the deprecated save/load you mentioned, it's using java's serialization. The disadvantage of this method is, it may not work when you upgrade bigdl version and jdk version. The advantage of this method is, it can save everything and easy to use. So we still use it as the checkpoint method. Your environment won't change during a training. Another couple are saveModule/loadModule, it use protobuf, and the saved model can be used in different bigdl version environment.

So if you want to resume your training from a optimizer's checkpoint, you should use deprecated load method to load the checkpoint. The error message you attached shows your are using loadModule method.

java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at com.intel.analytics.bigdl.utils.serializer.ModuleLoader$.initTensorStorage(ModuleLoader.scala:122)
at com.intel.analytics.bigdl.utils.serializer.ModuleLoader$.loadFromFile(ModuleLoader.scala:59)
at com.intel.analytics.bigdl.nn.Module$.loadModule(Module.scala:61)
at com.intel.analytics.bigdl.optim.DistriOptimizerSpec$$anonfun$17.apply$mcV$sp(DistriOptimizerSpec.scala:500)
at com.intel.analytics.bigdl.optim.DistriOptimizerSpec$$anonfun$17.apply(DistriOptimizerSpec.scala:474)
at com.intel.analytics.bigdl.optim.DistriOptimizerSpec$$anonfun$17.apply(DistriOptimizerSpec.scala:474)
emartinezs44 commented 2 years ago

There is a problem with java serialization related with the model´s size. Its two times bigger. Besides, if you want to resume a KerasNet model from a checkpoint you need tu use the deprecated load method and then cast that Model[T] instance to KerasNet. Because Spark based optimization process can be executed in very noisy clusters, to resume a training process from a checkpoint is something very useful. Besides I had problems related to reflection with the MultiShape class. I will share the error. But, what is the problem in changing the checkpoint generation using protobuf based method?

qiuxin2012 commented 2 years ago

No problem, we can change the checkpoint from save to saveModule. I will add this to our plan. You can track intel-analytics/analytics-zoo-internal#37 for detail.