terminalstatic / go-xsd-validate

Xsd validation for go based on libxml2
MIT License
72 stars 18 forks source link

SIGSEGV or SIGBUS when validating #15

Open AlwxSin opened 1 year ago

AlwxSin commented 1 year ago

I've got strange issue when trying to validate large number (more than 100) of large xml files (20-130mb). It looks like this

unexpected fault address 0xc0068c3000
fatal error: fault
[signal SIGBUS: bus error code=0x4 addr=0xc0068c3000 pc=0x55dffd]

or

SIGSEGV: segmentation violation
PC=0x7f19813d22222e3 m=43 sigcode=1
signal arrive during cgo execution

Stacktrace always the same

runtime.cgocall(0x868de0, 0xc000739698)
  /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000739670 sp=0xc000739638 pc=0x40a68b
github.com/terminalstatic/go-xsd-validate._C2func_cParseDoc(0x7fe986193010, 0x5e6c74f, 0x1)
  _cgo_gotypes.go:254 +0x57 fp=0xc000739698 sp=0xc000739670 pc=0x538d57
github.com/terminalstatic/go-xsd-validate.parseXmlMem.func3(0x7fe986193010?, {0xc020aa0000?, 0x5e6c74f, 0x757e000?}, 0x1?)
  /builds/app/.go/pkg/mod/github.com/terminalstatic/go-xsd-validate@v0.1.5/libxml2.go:433 +0x5a fp=0xc0007396e8 sp=0xc000739698 pc=0x5397fa
github.com/terminalstatic/go-xsd-validate.parseXmlMem({0xc020aa0000, 0x5e6c74f, 0x757e000}, 0xfe?)
  /builds/app/.go/pkg/mod/github.com/terminalstatic/go-xsd-validate@v0.1.5/libxml2.go:433 +0xa5 fp=0xc000739780 sp=0xc0007396e8 pc=0x5395e5
github.com/terminalstatic/go-xsd-validate.NewXmlHandlerMem({0xc020aa0000?, 0xc00f542010?, 0x0?}, 0x1d?)
  /builds/app/.go/pkg/mod/github.com/terminalstatic/go-xsd-validate@v0.1.5/validate_xsd.go:94 +0x29 fp=0xc0007397d8 sp=0xc000739780 pc=0x53a3e9

Problem is that I can't reproduce it on my machine with same files and I don't have access to server, where error occurs.

Error can happen at any time, on any file, can't reproduce it on exact one file or set of files. Only on bunch of xml's. Error can happen at second file or at 29th, no pattern.

How can I debug or reproduce error? Maybe there is a bug C code?

terminalstatic commented 1 year ago

I've been away from go and C for quite some time so ymmv. If I understand it right it only happens on a particular machine? Does the architecture of the machine differ from your dev machine? What OS ist this and your dev machine running? Do you cross compile? Which version of go are you using? Did you try a different (possibly older) one?

AlwxSin commented 1 year ago

I think, it depends on data, rather than hardware. Because we have tested several setups (everywhere go1.21):

We do not cross compile, binary builds on same os where it runs. And we can't try older versions, because we need updates from #12

AlwxSin commented 1 year ago

Ah, forgot to mention, that we never run production data on local or dev machines

terminalstatic commented 1 year ago

Understandable but really hard to debug then. Maybe far fetched but when you handle heaps of data concurrently (do you?), did you play around with go's and memory settings?

AlwxSin commented 1 year ago

Yeah, we process xml files concurrently in workers, it may affect?

play around with go's and memory settings?

No, default everywhere.

terminalstatic commented 1 year ago

Just to make sure, you are freeing correctly and are using Init and cleanup only once?

AlwxSin commented 1 year ago

I should. Init on startup and then in each worker I handle files, validation and cleanup.

terminalstatic commented 1 year ago

Cleanup or Free? Cleanup should only be called when program or part of program exits ...

AlwxSin commented 1 year ago

Cleanup on program exit and Free on workers' job done.

terminalstatic commented 1 year ago

Can't really help immediately then I guess. Only thing I could think of is to fake huge xml requests and concurrency using go 1.21 but unfortunately I don't really have spare time currently to set this up.

terminalstatic commented 1 year ago

Just a short notice, I tested this a little and I still have the suspicion that the cause could be go's memory management. You could try to play around with the InitWithGc function time parameter and the GOMEMLIMIT env variable. You could also check your systems ulimit settings. You could probably also try to delay worker execution and/or turn the concurrency down.

Funny thing is my mac with go 1.21 fails pretty soon when testing with concurrency of 100 and 100MB xml file ... my linux machine with go1.17 just chuckles along, with the restriction that with a concurrency of 100 it sometimes just stalls because of cpu load. I'd personally give it a lower concurrency setting, at least on my hardware around 20 performs rather well.

I will give this a spin a little later on my linux machine with go 1.21 and check if it makes a difference.

AlwxSin commented 1 year ago

Does it fails with same stacktrace?

terminalstatic commented 1 year ago

On mac it just fails with a killed signal without a trace, on linux it never fails.