segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
667 stars 39 forks source link

Porting to Android #26

Closed UncleLiaoNN closed 3 years ago

UncleLiaoNN commented 3 years ago

Hi, I am trying to run onnx model on Android and have sharted with the steps like is described there: https://github.com/onnx/tutorials/blob/master/tutorials/PytorchCaffe2MobileSqueezeNet.ipynb

import onnx
import caffe2.python.onnx.backend
from onnx import helper

# Load the ONNX GraphProto object. Graph is a standard Python protobuf object
model = onnx.load("model.onnx")

Unfortinately I receive an error:

---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
<ipython-input-8-0e15f43f99e0> in <module>()
      1 # Load the ONNX GraphProto object. Graph is a standard Python protobuf object
----> 2 model = onnx.load("model.onnx")
      3 

2 frames
/usr/local/lib/python3.6/dist-packages/onnx/__init__.py in _deserialize(s, proto)
     95                          '\ntype is {}'.format(type(proto)))
     96 
---> 97     decoded = cast(Optional[int], proto.ParseFromString(s))
     98     if decoded is not None and decoded != len(s):
     99         raise google.protobuf.message.DecodeError(

DecodeError: Error parsing message

Could you please what could be the issue? I use EN model and Google Colab

bminixhofer commented 3 years ago

Which version of the onnx package are you using? I just tried this locally:

(py38) bminixhofer@pop-os:~/Documents/Projects/nnsplit/models/en$ python
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import onnx; onnx.__version__
'1.8.1'
>>> model = onnx.load("model.onnx")
>>> 

You could also try redownloading the model, maybe it is corrupted somehow:

(py38) bminixhofer@pop-os:~/Documents/Projects/nnsplit/models/en$ shasum model.onnx 
c77caf94f64ff91702aa374b8a0a83befef56f95  model.onnx
UncleLiaoNN commented 3 years ago

Thanks for the support, yes, it looks like something went wrong with the model. I was able to load it, but the next steps still seems to be tricky:

# prepare the caffe2 backend for executing the model this converts the ONNX graph into a
# Caffe2 NetDef that can execute it. Other ONNX backends, like one for CNTK will be
# availiable soon.
prepared_backend = caffe2.python.onnx.backend.prepare(model)

gives an error: IndexError: Input embedding.weight is undefined!

And I have tried: onnx-tf convert -i "model.onnx" -o "model.pb" and it fails with the error: ValueError: Input size (depth of inputs) must be accessible via shape inference, but saw value None.

bminixhofer commented 3 years ago

The problem here seems to be that onnx-tf does not support dynamic dimensions. If you visualize the ONNX model (https://netron.app/ is excellent!) you'll see that the input has dimension batch x length so they do not have fixed values.

Someone else had the same issue: https://github.com/onnx/onnx-tensorflow/issues/593

Converting ONNX to another format is in general problematic. Are you sure you can not execute the ONNX model?

UncleLiaoNN commented 3 years ago

Well, I just follow the "official tutorial" from ONNX: https://github.com/onnx/tutorials/blob/master/tutorials/PytorchCaffe2MobileSqueezeNet.ipynb

I thought that it should be the right way, since ONNX guys recommend to use it.

bminixhofer commented 3 years ago

Right, in that case converting to caffe2 might work. Converting with onnx-tf will not work.

Unfortunately I can't help that much with this problem. The ONNX model is well-formed (e. g. works in both tract and onnxruntime) so it is not a problem with NNSplit itself. It is exported here:

https://github.com/bminixhofer/nnsplit/blob/27705186bcdc340796b5a44d382e7d787aab69de/train/model.py#L143-L175

A quick google search for IndexError: Input embedding.weight is undefined! comes up with a couple of related issues, maybe that can help solve your problem.

UncleLiaoNN commented 3 years ago

I have added a couple of workarounds from https://github.com/onnx/onnx/issues/2902

this one: https://github.com/onnx/onnx/issues/2902#issuecomment-662634157 and this one: https://github.com/onnx/onnx/issues/2660#issuecomment-605874784

now I was able to call onnx_graph_to_caffe2_net succesfully and get two .pb files. Next is to check if they are correct

bminixhofer commented 3 years ago

One thing you could try: If you can run Javascript on Android (not sure how easy this is; I have no experience with Android Apps) you could use the JS bindings for nnsplit: https://www.npmjs.com/package/nnsplit

bminixhofer commented 3 years ago

now I was able to call onnx_graph_to_caffe2_net succesfully

Great! Recall that the model takes utf8 encoded bytes as input and returns two numbers for each byte, first whether there is a sentence boundary at this position and second whether there is a token boundary. Good luck.

UncleLiaoNN commented 3 years ago

@bminixhofer I was able to build https://github.com/microsoft/onnxruntime for Android and now I am able to load nnsplit onnx model. But for our case there is a showstopper issue: microsoft/onnxruntime#6261

uint8 is not supported as input on Java level. Do you have an idea how to change the model to workaround the issue?

For now I get an error: Error code - ORT_INVALID_ARGUMENT - message: Unexpected input data type. Actual: (tensor(int8)) , expected: (tensor(uint8))

bminixhofer commented 3 years ago

Modifying the ONNX model to change the input type is apparently not easy (https://github.com/onnx/onnx/issues/2738). You can either do that (somehow) or reexport the model. In that case you have to change this line to a supported dtype:

https://github.com/bminixhofer/nnsplit/blob/fef0a94807d9a3f8195673f0cbe1c247d89d9e42/train/model.py#L157

To reexport the model you'll have to either train a new one from scratch following the notebook or load the weights from ONNX (https://github.com/onnx/onnx/issues/1425) and do some reshapes / transposes to get them into PyTorch format.

UncleLiaoNN commented 3 years ago

@bminixhofer thanks for the advises! is it possible that you share slices.pkl and texts.txt for supported languages? I think it could be useful for anyone who want to check \ repeat training process

bminixhofer commented 3 years ago

The train notebook contains a guide how to obtain those. From the command line you can do e. g.

python text_data.py --dump_path ../data/ruwiki-20181001-corpus.xml --text_path ../data/ru.txt --slice_path ../data/ru.pkl

where the .xml file is a dump from linguatools.

UncleLiaoNN commented 3 years ago

@bminixhofer thats true. I just thought you have all of them on google drive

BTW, I wes able to try change from torch.uint8 to torch.int8. it helped. but now I have and issue with some sort of onnx reshaper

Non-zero status code returned while running Reshape node. Name:'Reshape_46' Status Message: reshape_helper.h:38 onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape &, std::vector<int64_t> &) size != 0 && (input_shape.Size() % size) == 0 was false. The input tensor cannot be reshaped to the requested shape. Input shape:{1,8,4}, requested shape:{-1,17,2}

when I set input as

Map<String, OnnxTensor> container = new HashMap<>();
NodeInfo inputMeta = session.getInputInfo().values().iterator().next();
long[] inputShape = new long[2];
inputShape[0] = 1;
inputShape[1] = 17;   //exmple size of the input string
Object tensorData = OrtUtil.reshape(b, inputShape);
OnnxTensor tensor = OnnxTensor.createTensor(env, tensorData);
container.put(inputMeta.getName(), tensor);
bminixhofer commented 3 years ago

I just thought you have all of them on google drive

I haven't and I would not have enough space to upload all of them. Anyway, it's not important IMO because you can generate them quite easily.

[...] The input tensor cannot be reshaped to the requested shape. Input shape:{1,8,4}, requested shape:{-1,17,2}

Can you try an even input shape? The model downsamples the input by a factor of 2, runs it through the LSTM, then upsamples it to make it faster. In NNSplit the input length is always padded by zeros to be even. There is also some fixed padding with zeros at the start and at the end (by default 5) so the network recognizes it is the start/end of the sequence: https://github.com/bminixhofer/nnsplit/blob/3edc0248f401dbaccc817a1ec5c62310addfc873/nnsplit/src/lib.rs#L309-L318

UncleLiaoNN commented 3 years ago

@bminixhofer even shape works. so I need to encode string to UTF-8, add padding to start and end and make it even?

bminixhofer commented 3 years ago

Great! Yes, exactly.

UncleLiaoNN commented 3 years ago

@bminixhofer could you please advice about the return values? How should I understand floats?

bminixhofer commented 3 years ago

It returns for each element in the input sequence 2 values. From above:

returns two numbers for each byte, first whether there is a sentence boundary at this position and second whether there is a token boundary.

So you'll need some threshold (default in nnsplit is 0.8), you consider values above this as boundary. Then e. g. these values:

a s d f g
0 0 1 0 1

would result in these splits: ["asd", "fg"]. And as I said the first value indicates sentence boundaries, the second value token boundaries.

UncleLiaoNN commented 3 years ago

For now most of the values are negatives between -10 and 0, do you think this is correct output in general?

image

bminixhofer commented 3 years ago

Oh you need a sigmoid on top of it.

UncleLiaoNN commented 3 years ago

@bminixhofer ok, done. thank you!

Now I am able to run it and get ~correct sentence boundaries on Android. Will perform more testing later with the model trained for more than 1\2 of epoch:)

Thanks for the great support!

bminixhofer commented 3 years ago

Great! You're welcome.

poor1017 commented 3 years ago

@UncleLiaoNN Hi, about the error

ValueError: Input size (depth of inputs) must be accessible via shape inference, but saw value None.

have you solved it?

I have the same problem, and my model consists of vgg and rnn.

UncleLiaoNN commented 3 years ago

@poor1017 no, i was not able to address this issue when run "onnx-tf convert" cmd. But I was able to convert the model with the workarounds from this comment: https://github.com/bminixhofer/nnsplit/issues/26#issuecomment-772552820

poor1017 commented 3 years ago

@UncleLiaoNN Thank you for replying, en..... I have to split the model into vgg.pb and rnn.pb.

UncleLiaoNN commented 3 years ago

@poor1017 please try to follow the tutorial from initial post: https://github.com/onnx/tutorials/blob/master/tutorials/PytorchCaffe2MobileSqueezeNet.ipynb

or try to use onnxruntime to run without splitting:)

UncleLiaoNN commented 3 years ago

@bminixhofer Since required change (uint8->int8) is very small, very easy code allows to open -> save model with another input type:

import onnx
onnx_model = onnx.load("en.onnx")
inputs = onnx_model.graph.input
for input in inputs:
  input.type.tensor_type.elem_type = 3
onnx.save(onnx_modelEn, 'en_int8.onnx')

(actually it chages only one byte (from 0x02 to at 0x03 the end of the .onnx file)

it works like a charm!:)

3 in INT8 and 2 is UINT8 from: https://deeplearning4j.org/api/latest/onnx/Onnx.TensorProto.DataType.html

bminixhofer commented 3 years ago

Cool, thanks. That's good to know! I'm a bit surprised that it casts correctly s. t. 0 <=> -127, 127 <=> 0, 255 <=> 128 etc. for the embedding lookup but if it works it works :)

UncleLiaoNN commented 3 years ago

@bminixhofer As far as I understand the model itself is not uint8 bounded? It is only for store step?

bminixhofer commented 3 years ago

Well the uint8 input is fed directly into the embedding layer:

https://github.com/bminixhofer/nnsplit/blob/22ccdbe6e0a5a7befa8d8f232f6740a1002872b7/train/model.py#L50-L53

I would've expected an ONNX model with int8 input to error for e. g. x = -10 because there is no entry for that in the embedding, but it is plausible that it's shifted appropriately internally.