strawlab / strand-braid

Live, low-latency 2D and 3D tracking from single or multiple high-speed cameras
https://strawlab.org/braid/
Other
46 stars 8 forks source link

corrupted size vs. prev_size & double free or corruption (!prev) error #21

Closed RemcoPietersWUR closed 7 months ago

RemcoPietersWUR commented 1 year ago

I'm testing the 0.12.0-alpha.2 release and getting the corrupted size vs. prev_size error (ubuntu 22.04) or double free or corruption (!prev) error (ubuntu 20.04). It's a bit erratic sometimes the error pops-up directly after synchronization other times it happens hours after starting tracking.

astraw commented 1 year ago

Is it the strand-cam or the braid program that is crashing? I guess strand-cam. I guess it is due to removing jemallocator (in 7893896a356fe5a91e9bd6547df56e5d4d07dcf0). I do not understand why this seems to affect you but not us. I started a new build in which I re-enabled jemallocator and will make a new alpha preview release which you can test. I will update this ticket when that is ready.

RemcoPietersWUR commented 1 year ago

I guess Strand-cam crashes. From the tracking data we see that the cameras stop working, one by one. With a Crtl+c Braid closes cleanly. I will try to reproduce the error with only Strand-cam running.


Van: Andrew Straw @.> Verstuurd: donderdag 13 oktober 2022 14:55 Aan: strawlab/strand-braid @.> CC: Pieters, Remco @.>; Author @.> Onderwerp: Re: [strawlab/strand-braid] corrupted size vs. prev_size & double free or corruption (!prev) error (Issue #21)

Is it the strand-cam or the braid program that is crashing? I guess strand-cam. I guess it is due to removing jemallocator (in 7893896https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrawlab%2Fstrand-braid%2Fcommit%2F7893896a356fe5a91e9bd6547df56e5d4d07dcf0&data=05%7C01%7Cremco.pieters%40wur.nl%7Ceaed8e5fa5f34fcfee8508daad1a3f82%7C27d137e5761f4dc1af88d26430abb18f%7C0%7C0%7C638012625485848076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EuOMAjc%2Fyb6yOZPxz3btlWJ4CsinSGIMPlleKSPYzy8%3D&reserved=0). I do not understand why this seems to affect you but not us. I started a new build in which I re-enabled jemallocator and will make a new alpha preview release which you can test. I will update this ticket when that is ready.

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrawlab%2Fstrand-braid%2Fissues%2F21%23issuecomment-1277565425&data=05%7C01%7Cremco.pieters%40wur.nl%7Ceaed8e5fa5f34fcfee8508daad1a3f82%7C27d137e5761f4dc1af88d26430abb18f%7C0%7C0%7C638012625485848076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=raFZfCdIsQhfukr7Dg9Pi5lCHal3onkfa3aYDMk5Gh8%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE2D4TDZS764S2P3SQDNFL3WDABFDANCNFSM6AAAAAAREAUD4U&data=05%7C01%7Cremco.pieters%40wur.nl%7Ceaed8e5fa5f34fcfee8508daad1a3f82%7C27d137e5761f4dc1af88d26430abb18f%7C0%7C0%7C638012625485848076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NqbsrZGDYwr6Ov9KD7U0VUzsYsE3%2BX1mX8XxQ37IKP8%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

astraw commented 1 year ago

I re-enabled jemallocator in Strand Cam and make a new alpha preview release (0.12.0-alpha.3). Can you check if this solves the issue for you?

If this causes the issue to go away for you, great, but still I would guess there is a bug somewhere which only bypassed (not fixed) when jemallocator is used. For the record, can you confirm this happens with Basler cameras? Have you also tried the preview release with Allied Vision cameras and do you see any issues there? This would help narrow the ultimate underlying issue.

RemcoPietersWUR commented 1 year ago

The alpha 3 release didn't solve the issue. On the 22.04 machine we get the error immediately after syncing the cameras. Some log info:

Oct 14 09:46:07.952 INFO flydra2::connected_camera_manager: All expected cameras synchronized.
Oct 14 09:46:08.908 INFO braid_run::mainbrain: All cameras done synchronizing.
corrupted size vs. prev_size corrupted size vs. prev_size corrupted size vs. prev_size corrupted size vs. prev_size Oct 14 09:46:14.212 ERROR braid_run::multicam_http_session_handler: HttpSessionHandler::post() got error hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }))
Oct 14 09:46:14.212 ERROR braid_run::mainbrain: error sending clock model: error trying to connect: tcp connect error: Connection refused (os error 111)
Oct 14 09:46:15.715 ERROR braid_run::multicam_http_session_handler: HttpSessionHandler::post() got error hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }))
Oct 14 09:46:15.715 ERROR braid_run::mainbrain: error sending clock model: error trying to connect: tcp connect error: Connection refused (os error 111)

Is the log info somewhere saved? I copied it now from the terminal window.

On the 20.04 machine everything is still working, I will keep it on during the night.

We are only using Basler cameras, so yes. I will ask if someone else at the university has Allied Vision cameras.

astraw commented 1 year ago

No, the log is not saved anywhere.

The corrupted size vs. prev_size indeed seems to point to a deeper issue somewhere which I would love to fix... I will look into it, but likely not for at least a week, I am afraid. I hope you can move forward with Ubuntu 20.04 for now.

Perhaps Ubuntu 22.04 has some changes that make memory corruption easier to find by crashing immediately. This may be helpful in tracking down the cause and ultimately fixing it.

RemcoPietersWUR commented 1 year ago

For your info on the Ubuntu 20.04 system we have the same error strand-cam-pylon assert failure: corrupted size vs. prev_size. However, it takes longer (8+ hours). I could not find any Allied Vision cameras users on campus.

astraw commented 1 year ago

We installed Ubuntu 22.04 on one of our machines and cannot replicate the issue, even with Basler cameras. (At least so far -- without letting the cameras run for some hours. I will now do this overnight.)

In the meantime, a few more questions: does it happen even if only a single camera is in use? Does it happen with strand-cam-pylon alone, or does it require braid and multiple cameras? What if you put your trigger rate to much slower - does it still happen if the cameras are going slow?

If we cannot replicate the problem here, it is going to be hard for us to debug.

RemcoPietersWUR commented 1 year ago

Strand-cam-pylon alone works fine, we tested it for several hours. When I run Braid with a single camera it crashes with the same corrupted size vs. prev_size error. With a frame rate of 5 fps I get first a warning from flydra_feature_detector: Basler_22903869 acquisition duration statistics: mode 69 msec, max 99+ msec (longest: 47).

Regarding the debugging would it be useful that I send a PC from our lab to you? We have an identical machine that I could install and even a few cameras that currently are not in use.

astraw commented 1 year ago

I just sent you an email about enabling remote access to the machine, which should be much easier than shipping machines around.

astraw commented 7 months ago

As an update, Remco reported that reverting to Ubuntu 20.04 allowed them to workaround this issue. In the meantime, I made a couple changes like b4ed75cd26757a2449da48b319e64374e107bbd3 which hopefully help the situation. @RemcoPietersWUR I am closing the issue for now but if it arises again, please re-open this.