createTorchModel. Trace: java.lang.NegativeArraySizeException for Orca PyTorch bigdl backend

Elena-Qiu commented 3 years ago

When running the orca learn sentiment example using bigdl backend, will get the following error:

Traceback (most recent call last):
  File "/home/usr/repo/analytics-zoo/pyzoo/zoo/examples/orca/learn/pytorch/sentiment/main.py", line 203, in <module>
    backend="bigdl")
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/orca/learn/pytorch/estimator.py", line 101, in from_torch
    bigdl_type="float")
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/orca/learn/pytorch/estimator.py", line 284, in __init__
    self.model = TorchModel.from_pytorch(model)
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/pipeline/api/torch/torch_model.py", line 72, in from_pytorch
    "float", "createTorchModel", bys.getvalue(), weights)
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/common/utils.py", line 135, in callZooFunc
    raise e
  File "/home/usr/repo/analytics-zoo-env/pyzoo/zoo/common/utils.py", line 129, in callZooFunc
    java_result = api(*args)
  File "/home/usr/downloads/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/usr/downloads/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o41.createTorchModel. Trace:
java.lang.NegativeArraySizeException
    at py4j.Base64.decode(Base64.java:321)
    at py4j.Protocol.getBytes(Protocol.java:175)
    at py4j.Protocol.getObject(Protocol.java:296)
    at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:82)
    at py4j.commands.CallCommand.execute(CallCommand.java:77)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

The bigdl backend is implemented as follows:

train_iter = train_loader_creator({}, batch_size)
test_iter = test_loader_creator({}, batch_size)
if args.backend == "bigdl":
    net = model_creator({})
    optimizer = optim_creator(model=net,config={})
    orca_estimator = Estimator.from_torch(model=net,
                                          optimizer=optimizer,
                                          loss=criterion,
                                          metrics=[Accuracy()],
                                          backend="bigdl")
    orca_estimator.fit(data=train_iter, epochs=2, validation_data=test_iter,
                       checkpoint_trigger=EveryEpoch())
    res = orca_estimator.evaluate(data=test_iter)
    print("Accuracy of the network on the test images: %s" % res)

hkvision commented 3 years ago

The code is here: https://github.com/intel-analytics/analytics-zoo/pull/3918 @qiuxin2012 Do we support iterator as data?

qiuxin2012 commented 3 years ago

Why not dataloader? Iterator is not supported.

Elena-Qiu commented 3 years ago

I have changed the iterator to dataloader and it runs successfully with two workers using torch_distributed backend. But the bigdl backend still gets the same error as above.

hkvision commented 3 years ago

@qiuxin2012 Take a look.

qiuxin2012 commented 3 years ago

Use something like train_loader_creator, but not train_iter. train_iter is too big be pickled.

Elena-Qiu commented 3 years ago

I tried with the train_loader_creator and test_loader_creator but still got the same error. From the error information, it seems that the error occurs when running "self.model = TorchModel.from_pytorch(model)" and calling "o41.createTorchModel". Maybe it has something to do with model creating? @qiuxin2012

qiuxin2012 commented 3 years ago

Looks your model is too big to be pickled, please use a model creator function instead of model instance.

qiuxin2012 commented 3 years ago

https://github.com/intel-analytics/analytics-zoo/blob/ee24ffcc17458490da1a42bc6ad6e5f881d41106/pyzoo/zoo/orca/learn/pytorch/estimator.py#L275 The estimator shouldn't create model the model instance here, we should pickle a model creator function to bytes and pass it to executor.(pickle only support 155MB data) @Le-Zheng

hkvision commented 3 years ago

Next step:

Fix the create instance model bug mentioned above. After the fix, test with this example to see if config is too large as well.
Do some checking and throw some clearer error message if possible.
Any improvement to support larger size?

Elena-Qiu commented 3 years ago

I tried with the latest estimator.py but I still got the same error as before. The latest estimator.py is with

if isinstance(model, types.FunctionType):
    def model_creator(self):
        return model(self.config)
    model = model_creator(self)

hkvision commented 3 years ago

Take a look @qiuxin2012 @Le-Zheng

intel-analytics / analytics-zoo

createTorchModel. Trace: java.lang.NegativeArraySizeException for Orca PyTorch bigdl backend #274