intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
16 stars 3 forks source link

createTorchModel. Trace: java.lang.NegativeArraySizeException for Orca PyTorch bigdl backend #274

Open Elena-Qiu opened 3 years ago

Elena-Qiu commented 3 years ago

When running the orca learn sentiment example using bigdl backend, will get the following error:

Traceback (most recent call last):
  File "/home/usr/repo/analytics-zoo/pyzoo/zoo/examples/orca/learn/pytorch/sentiment/main.py", line 203, in <module>
    backend="bigdl")
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/orca/learn/pytorch/estimator.py", line 101, in from_torch
    bigdl_type="float")
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/orca/learn/pytorch/estimator.py", line 284, in __init__
    self.model = TorchModel.from_pytorch(model)
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/pipeline/api/torch/torch_model.py", line 72, in from_pytorch
    "float", "createTorchModel", bys.getvalue(), weights)
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/common/utils.py", line 135, in callZooFunc
    raise e
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/common/utils.py", line 129, in callZooFunc
    java_result = api(*args)
  File "/home/usr/downloads/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/usr/downloads/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o41.createTorchModel. Trace:
java.lang.NegativeArraySizeException
    at py4j.Base64.decode(Base64.java:321)
    at py4j.Protocol.getBytes(Protocol.java:175)
    at py4j.Protocol.getObject(Protocol.java:296)
    at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:82)
    at py4j.commands.CallCommand.execute(CallCommand.java:77)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

The bigdl backend is implemented as follows:

train_iter = train_loader_creator({}, batch_size)
test_iter = test_loader_creator({}, batch_size)
if args.backend == "bigdl":
    net = model_creator({})
    optimizer = optim_creator(model=net,config={})
    orca_estimator = Estimator.from_torch(model=net,
                                          optimizer=optimizer,
                                          loss=criterion,
                                          metrics=[Accuracy()],
                                          backend="bigdl")
    orca_estimator.fit(data=train_iter, epochs=2, validation_data=test_iter,
                       checkpoint_trigger=EveryEpoch())
    res = orca_estimator.evaluate(data=test_iter)
    print("Accuracy of the network on the test images: %s" % res)
hkvision commented 3 years ago

The code is here: https://github.com/intel-analytics/analytics-zoo/pull/3918 @qiuxin2012 Do we support iterator as data?

qiuxin2012 commented 3 years ago

Why not dataloader? Iterator is not supported.

Elena-Qiu commented 3 years ago

I have changed the iterator to dataloader and it runs successfully with two workers using torch_distributed backend. But the bigdl backend still gets the same error as above.

hkvision commented 3 years ago

@qiuxin2012 Take a look.

qiuxin2012 commented 3 years ago

Use something like train_loader_creator, but not train_iter. train_iter is too big be pickled.

Elena-Qiu commented 3 years ago

I tried with the train_loader_creator and test_loader_creator but still got the same error. From the error information, it seems that the error occurs when running "self.model = TorchModel.from_pytorch(model)" and calling "o41.createTorchModel". Maybe it has something to do with model creating? @qiuxin2012

qiuxin2012 commented 3 years ago

Looks your model is too big to be pickled, please use a model creator function instead of model instance.

qiuxin2012 commented 3 years ago

https://github.com/intel-analytics/analytics-zoo/blob/ee24ffcc17458490da1a42bc6ad6e5f881d41106/pyzoo/zoo/orca/learn/pytorch/estimator.py#L275 The estimator shouldn't create model the model instance here, we should pickle a model creator function to bytes and pass it to executor.(pickle only support 155MB data) @Le-Zheng

hkvision commented 3 years ago

Next step:

Elena-Qiu commented 3 years ago

I tried with the latest estimator.py but I still got the same error as before. The latest estimator.py is with

if isinstance(model, types.FunctionType):
    def model_creator(self):
        return model(self.config)
    model = model_creator(self)
hkvision commented 3 years ago

Take a look @qiuxin2012 @Le-Zheng