oliverguhr / german-sentiment-lib

An easy to use python package for deep learning-based german sentiment classification.
https://pypi.org/project/germansentiment/
MIT License
58 stars 7 forks source link

Converting into a onnx model #13

Open sehHeiden opened 1 year ago

sehHeiden commented 1 year ago

Could I add some ONNX export version?

My current attempt is:

# Initialize the model
model = germansentiment.SentimentModel()

# Dummy input that matches the input dimensions of the model
dummy_input = torch.randint(0, 30_000, (1, 512), dtype=torch.long)

# Export to ONNX
torch.onnx.export(model.model, dummy_input, "german_sentiment_model.onnx")

# Export the vocab
with open('vocab.json', 'w') as f:
    json.dump(model.tokenizer.vocab, f)

Than I used the model in Elixir:

{model, params} = AxonOnnx.import("./models/models/german_sentiment_model.onnx")

{:ok, vocab_string} = File.read("./models/models/vocab.json")
{:ok, vocab_map} = Jason.decode(vocab_string)

# Tokenize
input_text = "Ein schlechter Film"
token_list = Enum.map(String.split(input_text, " "), fn x -> vocab_map[x] end)
token_tensor = Nx.tensor(List.duplicate(0, 512 - length(token_list)))
token_tensor = Nx.concatenate([Nx.tensor(token_list), token_tensor])

{init_fn, predict_fn} = Axon.build(model)

predict_fn.(params, token_tensor)

But I still have some problems/questions:

is this correct
why do some keys in the vocab.json start with ##
why are keys are named ["unused{x}"]
why does the prediction not scale 0 to 1, but are signed floats
why do some strings not work in my version not work. The string "Ein scheiß Film" works on hugging face but not in the export.
Why are some keys in capital letters. While the text is always converted to lower?

about 4) I currently scale the prediction as follows:

prediction = predict_fn.(params, token_tensor)
one_hot = Nx.divide(Nx.pow(2, prediction), Nx.sum(Nx.pow(2, prediction)))
poltical_score = 5*(Nx.to_number(one_hot[0][0]) - Nx.to_number(one_hot[0][1])) 

about 5) In my version above keys that not matched return nil. I changed that to 0. But that changes to meaning of the sentence.

I opened a question in the Elixir Forum about it here.

oliverguhr commented 1 year ago

Hi @sehHeiden this is an interesting question. The problem is the tokenization. The process is a bit more complex than splitting the words. Longer and compound words get split up into individual tokens, it works a bit like a simple compression algorithm. The huggingface team has a library for all the different tokenizer. To make it work, you would need to implement the BertTokenizer in Elixir or build a wrapper for the compiled Rust tokenizers from this lib.

Or you use a tool to run the original python code in Elixir, something like this.