single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
MIT License
80 stars 21 forks source link

[python] Try 3.12 again [WIP] #2692

Open johnkerl opened 1 month ago

johnkerl commented 1 month ago

Issue and/or context: #1849

See also:

johnkerl commented 1 month ago

Full CI manual run: https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407459107

johnkerl commented 1 month ago

Pin-fail on commit 1: https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407459107/job/25913248219

Trying commit 2: https://github.com/single-cell-data/TileDB-SOMA/pull/2692/commits https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407580430

Commit 3, with lint fix: https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407659739

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.19%. Comparing base (6d6c90e) to head (3b2f36a).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2692 +/- ## ======================================= Coverage 90.19% 90.19% ======================================= Files 37 37 Lines 4019 4019 ======================================= Hits 3625 3625 Misses 394 394 ``` | [Flag](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2692/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | Coverage Δ | | |---|---|---| | [python](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2692/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `90.19% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data#carryforward-flags-in-the-pull-request-comment) to find out more. | [Components](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2692/components?src=pr&el=components&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | Coverage Δ | | |---|---|---| | [python_api](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2692/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `90.19% <ø> (ø)` | | | [libtiledbsoma](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2692/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `∅ <ø> (∅)` | |
johnkerl commented 1 month ago

Here's the problem:

See also

This PR full CI is red across the board for all Python versions on MacOS: https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9407659739

And as of today this still reproduces on my Mac:

$ pip list | grep arrow
pyarrow                       12.0.1
$ pip install pyarrow==14.0
Collecting pyarrow==14.0
...
Successfully installed pyarrow-14.0.0
johnkerl@exade[prod][HEAD][tiledb]$ python
Python 3.9.13 (v3.9.13:6de2ca5339, May 17 2022, 11:23:25)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> import tiledb
>>> tiledb.open('s3://anything/at/all')
Fatal error condition occurred in /Users/runner/work/crossbow/crossbow/vcpkg/buildtrees/aws-c-common/src/v0.8.9-fed0b55d0f.clean/source/allocator.c:121: allocator != ((void*)0)
Exiting Application
################################################################################
Stack trace:
################################################################################
1   libarrow.1400.dylib                 0x000000010faf4fe6 aws_fatal_assert + 70
2   libarrow.1400.dylib                 0x000000010faf3e57 aws_mem_acquire + 55
3   libarrow.1400.dylib                 0x000000010fb07392 aws_string_new_from_cursor + 50
4   libarrow.1400.dylib                 0x000000010fb01328 aws_json_value_get_from_object + 40
...

Same problem on pip install pyarrow==13. We need to go down to pyarrow==12 to not have this nasty crash.

And yet Python 3.12 doesn't have pyarrow 12 or below.

cc @ihnorton

ivirshup commented 1 month ago

Yeah, I am seeing this now too. It looks like it is specific to when pyarrow is imported before tiledb, so CI and tests ran fine.

johnkerl commented 1 month ago

Needs discussion on whether we:

None of these feels ideal.

ivirshup commented 1 month ago

Personally, I think the problems caused by upper bounding the allowed versions of arrow is more of an issue that python=3.12 support specifically. And the macOS support in python 3.12 is more of a symptom.

Is there an issue tracking this for tiledb? And is it python specific? It looks like tiledb and arrow are both trying to do something with AWS auth at the C level that's causing the issue.

ivirshup commented 1 month ago

I guess in tiledbsoma you have the option of making sure tiledb is imported before pyarrow on import tiledbsoma but you gotta hope people didn't import pyarrow first. I guess you could probably check for that though.

Also, if this behavior is good enough for tiledb-py, surely it is good enough for tiledbsoma?

Which ends up meaning, I'd go for tiledbsoma doing:

Say we do Python 3.12 but with a heavy caveat "watch your import ordering due to this issue"

And have tiledb/ tiledb-py take ownership of the issue and fix it.

johnkerl commented 1 month ago

And have tiledb/ tiledb-py take ownership of the issue and fix it.

We had extensive internal discussions a few months ago and determined it wasn't our bug.

I'll resuscitate those discussions freshly. cc @ihnorton .

If we were mistaken and it is a bug of ours, that will be a learning; if we're not, there may still be mitigation options we can take to work around the pyarrow issue.

johnkerl commented 1 month ago

Please see also:

and in particular:

cc @ivirshup @ihnorton

ivirshup commented 1 month ago

We had extensive internal discussions a few months ago and determined it wasn't our bug.

Ah, sorry, I wasn't making a value judgement about whose fault this is. It's more that import pyarrow, tiledb; tiledb.open("s3://...") is really a issue/ bug on the tiledb side and not the responsibility of tiledbsoma.

My suggested short term mitigation would be:

Has an issue been opened over on the arrow side? It seems like the underlying cause could be an issue with how they do some aws configuration on import, so maybe this could even be patched on their side?

johnkerl commented 1 month ago

Good news -- I found "one more import" (#2734) which helps ...

johnkerl commented 1 month ago

Some status:

Repro

#!/usr/bin/env python

import pyarrow
import tiledb
tiledb.open("s3://anything/at/all")

fails with

Fatal error condition occurred in /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/vcpkg/buildtrees/aws-c-common/src/v0.8.9-fed0b55d0f.clean/source/allocator.c:121: allocator != ((void*)0)
Exiting Application
################################################################################
Stack trace:
################################################################################
1   libarrow.1400.dylib                 0x0000000106f90150 aws_fatal_assert + 80
2   libarrow.1400.dylib                 0x0000000106f8f528 aws_mem_acquire + 64
3   libarrow.1400.dylib                 0x0000000106fa1cc4 aws_string_new_from_cursor + 76
4   libarrow.1400.dylib                 0x0000000106f9bb94 aws_json_value_get_from_object + 44
5   libarrow.1400.dylib                 0x0000000106f8830c aws_endpoints_ruleset_new_from_string + 120
6   libarrow.1400.dylib                 0x0000000106f22430 _ZN3Aws3Crt9Endpoints10RuleEngineC2ERK15aws_byte_cursorS5_P13aws_allocator + 48
7   libarrow.1400.dylib                 0x00000001066cb820 _ZN3Aws8Endpoint23DefaultEndpointProviderINS_2S321S3ClientConfigurationENS2_8Endpoint19S3BuiltInParametersENS4_25S3ClientContextParametersEEC2EPKcm + 120
8   libtiledb.dylib                     0x0000000150982484 _ZN3Aws2S38S3ClientC2ERKNS_6Client19ClientConfigurationENS2_15AWSAuthV4Signer20PayloadSigningPolicyEbNS0_34US_EAST_1_REGIONAL_ENDPOINT_OPTIONE + 900
9   libtiledb.dylib                     0x0000000150151178 _ZN6tiledb6common11make_sharedINS_2sm14TileDBS3ClientELi66EJRKNS2_12S3ParametersERN3Aws6Client19ClientConfigurationENS8_15AWSAuthV4Signer20PayloadSigningPolicyERKbEEENSt3__110shared_ptrIT_EERAT0__KcDpOT1_ + 92
10  libtiledb.dylib                     0x000000015013b7b8 _ZNK6tiledb2sm2S311init_clientEv + 3840
11  libtiledb.dylib                     0x0000000150144d8c _ZNK6tiledb2sm2S313ls_with_sizesERKNS0_3URIERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEi + 76
12  libtiledb.dylib                     0x0000000150144cf4 _ZNK6tiledb2sm2S313ls_with_sizesERKNS0_3URIE + 44
13  libtiledb.dylib                     0x000000015016dbf4 _ZNK6tiledb2sm3VFS13ls_with_sizesERKNS0_3URIE + 264
14  libtiledb.dylib                     0x000000015002ce18 _ZNK6tiledb2sm14ArrayDirectory2lsERKNS0_3URIE + 60
15  libtiledb.dylib                     0x000000015002ea54 _ZN6tiledb2sm14ArrayDirectory20load_array_meta_urisEv + 144
16  libtiledb.dylib                     0x00000001500312dc _ZNSt3__120__packaged_task_funcIZN6tiledb6common10ThreadPool5asyncIZNS1_2sm14ArrayDirectory4loadEvE3$_3JEEEDaOT_DpOT0_EUlvE_NS_9allocatorISE_EEFNS2_6StatusEvEEclEv + 24
17  libtiledb.dylib                     0x000000015086594c _ZNSt3__113packaged_taskIFN6tiledb6common6StatusEvEEclEv + 80
18  libtiledb.dylib                     0x00000001508655a8 _ZN6tiledb6common10ThreadPool6workerEv + 76
19  libtiledb.dylib                     0x0000000150866adc _ZNSt3__114__thread_proxyB8ue170006INS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN6tiledb6common10ThreadPoolEFvvEPS9_EEEEEPvSE_ + 72
20  libsystem_pthread.dylib             0x0000000190c46f94 _pthread_start + 136
21  libsystem_pthread.dylib             0x0000000190c41d34 thread_start + 8
Abort trap: 6

on MacOS with pyarrow >= 13. Does not fail with pyarrow < 13. 100% reproducible, and immediate. The x86 and arm64 families are equally affected.

The above script passes after pip install pyarrow==12 and fails after pip install pyarrow==13. This process can be repeated, reproducing on demand the appearance of the issue in pyarrow 13 and above.

Workaround on this PR

Simply try importing tiledb before pyarrow. I'd tried this before a few months ago & it didn't work -- I found at the time adding import tiledb at the very top of apis/python/src/tiledbsoma/__init__.py did not suffice. However I found today that also adding it at the top of apis/python/tests/__init__.py does allow unit tests to pass on my MacOS laptop.

That's the good news.

The bad news is that this is now segfaulting in CI: https://github.com/single-cell-data/TileDB-SOMA/pull/2734#issuecomment-2168452901

Current experiment: maybe the segfault has to do with today's main. So I've applied this PR's patch to a few branches:

I'd love to reproduce this segfault and get a stack trace but sadly this command from the CI run

python -m pytest --cov=apis/python/src --cov-report=xml apis/python/tests -v --durations=20

does not segfault on my Mac. :(

Upstream

@ihnorton and I believe that this is a problem with pyarrow:

https://github.com/apache/arrow/blob/fe4d04f081e55ca2de7b1b67b10ad7dca96cfd9e/cpp/thirdparty/versions.txt#L54

ARROW_AWSSDK_BUILD_VERSION=1.10.55

and we'll file a ticket there.