nlpodyssey / cybertron

Cybertron: the home planet of the Transformers in Go
BSD 2-Clause "Simplified" License
280 stars 26 forks source link

Not able to run within a go func #10

Closed jonathan-wondereur closed 9 months ago

jonathan-wondereur commented 1 year ago

When I try to do text classification within a parallelized go function with around 600 rows of data, I run into a crash that I think is because of excessive memory use.

jonathan-wondereur commented 1 year ago

Not sure if this is related or the same issue, but after running for hundreds of thousands of lines I run into a crash.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xa1c680]

goroutine 3235221165 [running]:
github.com/nlpodyssey/spago/ag.(*Operator).forward(0xc03e301b80)
        /home/jonathan/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/operator.go:194 +0x80
created by github.com/nlpodyssey/spago/ag.NewOperator
        /home/jonathan/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/operator.go:58 +0xee
JackKCWong commented 1 year ago

I have the same error when running a textencoding.Encode a few hundred times.

matteo-grella commented 1 year ago

Thank you @jonathan-wondereur and @JackKCWong.

That error could be caused by a text generating a number of tokens not supported by the neural model.

Do you have an example of text that systematically produces that error?

For the textencoding.Encode please use the Cybertron version v0.1.3-0.20230219111654-ef2ca134a6d3 in your project.

jonathan-wondereur commented 1 year ago

That is what I thought too, but I tried on the lines that produced the error, and they all worked fine as far as I can tell. I will dig more and see if I can find an example.

How can I avoid a crash if the text is too long or if it creates more tokens than is supported by the neural model? I do not always have control of the input text. How would I break up or filter the input text, so it does not crash the whole program?

It would be much better if cyberton returned an error in these cases rather than crash my whole program.

matteo-grella commented 1 year ago

Alright, I have implemented the error handling for the scenario where the tokenizer output surpasses the neural model's capacity. Could you please test the latest changes by using the HEAD and provide me with your feedback?

matteo-grella commented 1 year ago

@jonathan-wondereur

bkono commented 1 year ago

@matteo-grella Not the OP, but I can confirm HEAD fixes my previous segfault.

ERR | onDirEvent: failed to encode text: input sequence too long: 641 > 512

☝️ -> that was previously a segfault.

Definitely solves my issue. Hopefully that covers what OP saw as well.

matteo-grella commented 1 year ago

Thanks! BTW, @bkono what do you mean with OP?

bkono commented 1 year ago

OP == original poster, @jonathan-wondereur in this case.

jonathan-wondereur commented 1 year ago

Ya, sorry, I have not yet had time to test this...

matteo-grella commented 9 months ago

Close as this is solved in v0.2.0.