pgpointcloud / pointcloud

A PostgreSQL extension for storing point cloud (LIDAR) data.
https://pgpointcloud.github.io/pointcloud/
Other
397 stars 108 forks source link

Crash on several architectures #313

Closed df7cb closed 2 years ago

df7cb commented 2 years ago

Seen on Debian on ppc64el with pointcloud 1.2.2:

16:03:12   SELECT PC_AsText(PC_Explode(PC_Patch(pt))) FROM pt_test;
16:03:12 ! server closed the connection unexpectedly
16:03:12 !  This probably means the server terminated abnormally
16:03:12 !  before or while processing the request.
16:03:12 ! connection to server was lost

Build log: https://pgdgbuild.dus.dg-i.net/job/pgpointcloud-binaries/27/architecture=ppc64el,distribution=sid/console Build status: https://pgdgbuild.dus.dg-i.net/job/pgpointcloud-binaries/

df7cb commented 2 years ago

It's also hitting amd64 with PG12 on Ubuntu Jammy: https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/pgpointcloud-binaries-snapshot/lastFailedBuild/architecture=amd64,distribution=jammy/console

pblottiere commented 2 years ago

Hello @df7cb,

First thank you for reporting this issue.

By taking a look with Valgrind on a pc_explode request, I noticed that fishy message:

==154307== 4 errors in context 3 of 4:
==154307== Conditional jump or move depends on uninitialised value(s)
==154307==    at 0x123D6708: pc_schema_from_pcid (pc_pgsql.c:389)
==154307==    by 0x123D4C63: pcpatch_unnest (pc_access.c:584)
==154307==    by 0x3C84E7: ExecMakeFunctionResultSet (execSRF.c:614)
==154307==    by 0x3EA717: ExecProjectSRF (nodeProjectSet.c:175)
==154307==    by 0x3EA804: ExecProjectSet (nodeProjectSet.c:105)
==154307==    by 0x3E42D8: UnknownInlinedFun (executor.h:248)
==154307==    by 0x3E42D8: ExecLimit (nodeLimit.c:96)
==154307==    by 0x3BF8D3: ExecProcNode (executor.h:248)
==154307==    by 0x3BF8D3: ExecutePlan (execMain.c:1632)
==154307==    by 0x3BF8D3: standard_ExecutorRun (execMain.c:350)
==154307==    by 0x5271BA: PortalRunSelect (pquery.c:921)
==154307==    by 0x5285CF: PortalRun (pquery.c:765)
==154307==    by 0x524146: exec_simple_query (postgres.c:1238)
==154307==    by 0x525DE9: PostgresMain (postgres.c:4347)
==154307==    by 0x234E5E: main (main.c:206)
==154307==  Uninitialised value was created by a heap allocation
==154307==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==154307==    by 0x65A78A: AllocSetContextCreateInternal (aset.c:468)
==154307==    by 0x640BE6: init_MultiFuncCall (funcapi.c:85)
==154307==    by 0x123D4C34: pcpatch_unnest (pc_access.c:568)
==154307==    by 0x3C84E7: ExecMakeFunctionResultSet (execSRF.c:614)
==154307==    by 0x3EA717: ExecProjectSRF (nodeProjectSet.c:175)
==154307==    by 0x3EA804: ExecProjectSet (nodeProjectSet.c:105)
==154307==    by 0x3E42D8: UnknownInlinedFun (executor.h:248)
==154307==    by 0x3E42D8: ExecLimit (nodeLimit.c:96)
==154307==    by 0x3BF8D3: ExecProcNode (executor.h:248)
==154307==    by 0x3BF8D3: ExecutePlan (execMain.c:1632)
==154307==    by 0x3BF8D3: standard_ExecutorRun (execMain.c:350)
==154307==    by 0x5271BA: PortalRunSelect (pquery.c:921)
==154307==    by 0x5285CF: PortalRun (pquery.c:765)
==154307==    by 0x524146: exec_simple_query (postgres.c:1238)

Then, when I run that specific request a lot, this leads to a crash sometimes.

Actually, it's due to the fact that we check if fcinfo->flinfo->fn_extra is NULL on the first call (after SRF_IS_FIRSTCALL()) for intializing the internal cache of schemas. However, it's not true due to SRF_FIRSTCALL_INIT() so it leads to some unstable behavior because the cache is not always initialised (thus the Conditional jump or move depends on uninitialised value(s) message returned by Valgrind).

I'm currently working on a PR for fixing that.

df7cb commented 2 years ago

Glad to see progress here, thanks!

pblottiere commented 2 years ago

Hello @df7cb,

I merged some bugfixes and made a new 1.2.3 release. Let me know if you still have issues.

df7cb commented 2 years ago

Thanks Paul!

The architectures are fine now, modulo some test differences on big-endian machines that I muted in the Debian package, but that isn't a problem.

pblottiere commented 2 years ago

The architectures are fine now

Good news :tada:

modulo some test differences on big-endian machines

Yes I didn't forget your previous ticket https://github.com/pgpointcloud/pointcloud/issues/278. I'm going to take a look asap.

Thanks a lot for your work @df7cb :+1: