Open johnkerl opened 1 month ago
Full CI manual run: https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407459107
Pin-fail on commit 1: https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407459107/job/25913248219
Trying commit 2: https://github.com/single-cell-data/TileDB-SOMA/pull/2692/commits https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407580430
Commit 3, with lint fix: https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407659739
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 90.19%. Comparing base (
6d6c90e
) to head (3b2f36a
).
Here's the problem:
See also
This PR full CI is red across the board for all Python versions on MacOS: https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407659739
And as of today this still reproduces on my Mac:
$ pip list | grep arrow
pyarrow 12.0.1
$ pip install pyarrow==14.0
Collecting pyarrow==14.0
...
Successfully installed pyarrow-14.0.0
johnkerl@exade[prod][HEAD][tiledb]$ python
Python 3.9.13 (v3.9.13:6de2ca5339, May 17 2022, 11:23:25)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> import tiledb
>>> tiledb.open('s3://anything/at/all')
Fatal error condition occurred in /Users/runner/work/crossbow/crossbow/vcpkg/buildtrees/aws-c-common/src/v0.8.9-fed0b55d0f.clean/source/allocator.c:121: allocator != ((void*)0)
Exiting Application
################################################################################
Stack trace:
################################################################################
1 libarrow.1400.dylib 0x000000010faf4fe6 aws_fatal_assert + 70
2 libarrow.1400.dylib 0x000000010faf3e57 aws_mem_acquire + 55
3 libarrow.1400.dylib 0x000000010fb07392 aws_string_new_from_cursor + 50
4 libarrow.1400.dylib 0x000000010fb01328 aws_json_value_get_from_object + 40
...
Same problem on pip install pyarrow==13
. We need to go down to pyarrow==12
to not have this nasty crash.
And yet Python 3.12 doesn't have pyarrow
12 or below.
cc @ihnorton
Yeah, I am seeing this now too. It looks like it is specific to when pyarrow is imported before tiledb, so CI and tests ran fine.
Needs discussion on whether we:
None of these feels ideal.
Personally, I think the problems caused by upper bounding the allowed versions of arrow is more of an issue that python=3.12 support specifically. And the macOS support in python 3.12 is more of a symptom.
Is there an issue tracking this for tiledb? And is it python specific? It looks like tiledb and arrow are both trying to do something with AWS auth at the C level that's causing the issue.
I guess in tiledbsoma you have the option of making sure tiledb is imported before pyarrow on import tiledbsoma
but you gotta hope people didn't import pyarrow
first. I guess you could probably check for that though.
Also, if this behavior is good enough for tiledb-py
, surely it is good enough for tiledbsoma
?
Which ends up meaning, I'd go for tiledbsoma
doing:
Say we do Python 3.12 but with a heavy caveat "watch your import ordering due to this issue"
And have tiledb
/ tiledb-py
take ownership of the issue and fix it.
And have tiledb/ tiledb-py take ownership of the issue and fix it.
We had extensive internal discussions a few months ago and determined it wasn't our bug.
I'll resuscitate those discussions freshly. cc @ihnorton .
If we were mistaken and it is a bug of ours, that will be a learning; if we're not, there may still be mitigation options we can take to work around the pyarrow issue.
Please see also:
and in particular:
cc @ivirshup @ihnorton
We had extensive internal discussions a few months ago and determined it wasn't our bug.
Ah, sorry, I wasn't making a value judgement about whose fault this is. It's more that import pyarrow, tiledb; tiledb.open("s3://...")
is really a issue/ bug on the tiledb
side and not the responsibility of tiledbsoma
.
My suggested short term mitigation would be:
tiledb
first or downgrade pyarrow.Has an issue been opened over on the arrow side? It seems like the underlying cause could be an issue with how they do some aws configuration on import, so maybe this could even be patched on their side?
Good news -- I found "one more import" (#2734) which helps ...
Some status:
#!/usr/bin/env python
import pyarrow
import tiledb
tiledb.open("s3://anything/at/all")
fails with
Fatal error condition occurred in /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/vcpkg/buildtrees/aws-c-common/src/v0.8.9-fed0b55d0f.clean/source/allocator.c:121: allocator != ((void*)0)
Exiting Application
################################################################################
Stack trace:
################################################################################
1 libarrow.1400.dylib 0x0000000106f90150 aws_fatal_assert + 80
2 libarrow.1400.dylib 0x0000000106f8f528 aws_mem_acquire + 64
3 libarrow.1400.dylib 0x0000000106fa1cc4 aws_string_new_from_cursor + 76
4 libarrow.1400.dylib 0x0000000106f9bb94 aws_json_value_get_from_object + 44
5 libarrow.1400.dylib 0x0000000106f8830c aws_endpoints_ruleset_new_from_string + 120
6 libarrow.1400.dylib 0x0000000106f22430 _ZN3Aws3Crt9Endpoints10RuleEngineC2ERK15aws_byte_cursorS5_P13aws_allocator + 48
7 libarrow.1400.dylib 0x00000001066cb820 _ZN3Aws8Endpoint23DefaultEndpointProviderINS_2S321S3ClientConfigurationENS2_8Endpoint19S3BuiltInParametersENS4_25S3ClientContextParametersEEC2EPKcm + 120
8 libtiledb.dylib 0x0000000150982484 _ZN3Aws2S38S3ClientC2ERKNS_6Client19ClientConfigurationENS2_15AWSAuthV4Signer20PayloadSigningPolicyEbNS0_34US_EAST_1_REGIONAL_ENDPOINT_OPTIONE + 900
9 libtiledb.dylib 0x0000000150151178 _ZN6tiledb6common11make_sharedINS_2sm14TileDBS3ClientELi66EJRKNS2_12S3ParametersERN3Aws6Client19ClientConfigurationENS8_15AWSAuthV4Signer20PayloadSigningPolicyERKbEEENSt3__110shared_ptrIT_EERAT0__KcDpOT1_ + 92
10 libtiledb.dylib 0x000000015013b7b8 _ZNK6tiledb2sm2S311init_clientEv + 3840
11 libtiledb.dylib 0x0000000150144d8c _ZNK6tiledb2sm2S313ls_with_sizesERKNS0_3URIERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEi + 76
12 libtiledb.dylib 0x0000000150144cf4 _ZNK6tiledb2sm2S313ls_with_sizesERKNS0_3URIE + 44
13 libtiledb.dylib 0x000000015016dbf4 _ZNK6tiledb2sm3VFS13ls_with_sizesERKNS0_3URIE + 264
14 libtiledb.dylib 0x000000015002ce18 _ZNK6tiledb2sm14ArrayDirectory2lsERKNS0_3URIE + 60
15 libtiledb.dylib 0x000000015002ea54 _ZN6tiledb2sm14ArrayDirectory20load_array_meta_urisEv + 144
16 libtiledb.dylib 0x00000001500312dc _ZNSt3__120__packaged_task_funcIZN6tiledb6common10ThreadPool5asyncIZNS1_2sm14ArrayDirectory4loadEvE3$_3JEEEDaOT_DpOT0_EUlvE_NS_9allocatorISE_EEFNS2_6StatusEvEEclEv + 24
17 libtiledb.dylib 0x000000015086594c _ZNSt3__113packaged_taskIFN6tiledb6common6StatusEvEEclEv + 80
18 libtiledb.dylib 0x00000001508655a8 _ZN6tiledb6common10ThreadPool6workerEv + 76
19 libtiledb.dylib 0x0000000150866adc _ZNSt3__114__thread_proxyB8ue170006INS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN6tiledb6common10ThreadPoolEFvvEPS9_EEEEEPvSE_ + 72
20 libsystem_pthread.dylib 0x0000000190c46f94 _pthread_start + 136
21 libsystem_pthread.dylib 0x0000000190c41d34 thread_start + 8
Abort trap: 6
on MacOS with pyarrow >= 13. Does not fail with pyarrow < 13. 100% reproducible, and immediate. The x86 and arm64 families are equally affected.
The above script passes after pip install pyarrow==12
and fails after pip install pyarrow==13
. This process can be repeated, reproducing on demand the appearance of the issue in pyarrow 13 and above.
Simply try importing tiledb
before pyarrow
. I'd tried this before a few months ago & it didn't work -- I found at the time adding import tiledb
at the very top of apis/python/src/tiledbsoma/__init__.py
did not suffice. However I found today that also adding it at the top of apis/python/tests/__init__.py
does allow unit tests to pass on my MacOS laptop.
That's the good news.
The bad news is that this is now segfaulting in CI: https://github.com/single-cell-data/TileDB-SOMA/pull/2734#issuecomment-2168452901
Current experiment: maybe the segfault has to do with today's main
. So I've applied this PR's patch to a few branches:
main
release-1.12
release-1.11
release-1.10
I'd love to reproduce this segfault and get a stack trace but sadly this command from the CI run
python -m pytest --cov=apis/python/src --cov-report=xml apis/python/tests -v --durations=20
does not segfault on my Mac. :(
@ihnorton and I believe that this is a problem with pyarrow
:
ARROW_AWSSDK_BUILD_VERSION=1.10.55
and we'll file a ticket there.
Issue and/or context: #1849
See also: