voltrondata-labs / arrow-benchmarks-ci

Benchmarks CI for Apache Arrow project
MIT License
0 stars 5 forks source link

`dataset-serialize` benchmark is failing with segfaults #166

Closed austin3dickey closed 11 months ago

austin3dickey commented 11 months ago

More details to come.

austin3dickey commented 11 months ago

The only machine that runs this is ursa-i9-9960x. Here is a build link.

231022-13:22:15.225 INFO: Initializing adapter
231022-13:22:15.255 INFO: source nyctaxi_multi_parquet_s3: download, if required
231022-13:22:15.263 INFO: constructed Dataset object for source in 0.0066 s
231022-13:22:15.263 INFO: case ('1pc', 'parquet'): create directory
231022-13:22:15.263 INFO: directory created, path: /dev/shm/bench-cd80377a/1pc-parquet-eed70f79-5ee5-4156-b932-79bdf3b754d0
231022-13:22:15.263 INFO: read 561000 rows of dataset nyctaxi_multi_parquet_s3 into memory
231022-13:22:15.528 INFO: read source dataset into memory in 0.2645 s
231022-13:22:20.250 INFO: try to perform login
231022-13:22:20.250 INFO: try: POST to https://conbench.ursa.dev/api/login/
231022-13:22:20.536 INFO: POST request to https://conbench.ursa.dev/api/login/: took 0.2858 s, response status code: 204
231022-13:22:20.536 INFO: ConbenchClient: initialized
231022-13:22:20.536 INFO: try: POST to https://conbench.ursa.dev/api/benchmark-results/
231022-13:22:20.607 INFO: POST request to https://conbench.ursa.dev/api/benchmark-results/: took 0.0712 s, response status code: 201
231022-13:22:20.671 INFO: stdout of ['du', '-sh', '/dev/shm/bench-cd80377a/1pc-parquet-eed70f79-5ee5-4156-b932-79bdf3b754d0']: 20M
231022-13:22:20.672 INFO: removing directory: /dev/shm/bench-cd80377a/1pc-parquet-eed70f79-5ee5-4156-b932-79bdf3b754d0
231022-13:22:20.674 INFO: case ('1pc', 'arrow'): create directory
231022-13:22:20.674 INFO: directory created, path: /dev/shm/bench-cd80377a/1pc-arrow-b8bced17-2cee-4bca-83d4-534d23e2f468
231022-13:22:20.674 INFO: read 561000 rows of dataset nyctaxi_multi_parquet_s3 into memory
231022-13:22:20.814 INFO: read source dataset into memory in 0.1400 s
Fatal Python error: Segmentation fault

Interestingly, sometimes one or two cases succeed before the segfault, and sometimes none of them do.

austin3dickey commented 11 months ago

Here's a breakdown of the number of successful dataset-serialize results per run this month:

       run_timestamp        |              run_id              | num_results
----------------------------+----------------------------------+-------------
 2023-10-01 20:47:05.27222  | 1653cbab792a4905950da7a357c27aab |          24
 2023-10-01 23:41:55.081723 | 1e2c0d5208784aa9a98e28fd387a8c67 |          24
 2023-10-02 02:37:15.753681 | 08be4f7cab094940b1c8c31ca9e902c4 |          24
 2023-10-02 05:34:42.301889 | 497dc6271cc541abbd2adec51695259d |          24
 2023-10-02 15:48:42.227189 | 1077a66e57a74edfbea848d528262f86 |          24
 2023-10-03 08:28:15.839898 | 6ca857816880414aa3ab96e2daf0860d |          24
 2023-10-03 14:49:27.261457 | c99fb3bbd61d429d9af8510254f6a8e1 |          24
 2023-10-03 20:20:28.55793  | a1e3d4d07c28450e88e3eba64f420707 |          24
 2023-10-03 23:11:30.643553 | d6530cb0f1cf41b8b1874677ef3ab37a |          24
 2023-10-04 11:00:46.131435 | 3564dbe69233453f8970fd6127ca222b |          24
 2023-10-04 16:00:31.168945 | 7deb05ad67484f16bd92545e79d91f90 |          24
 2023-10-05 08:23:45.134169 | aa5c53940d2942bbad57dad7eafc7e7f |          24
 2023-10-05 11:21:23.874677 | 980a34c6cdb7424189e0de6ca2924057 |          24
 2023-10-05 14:15:26.585407 | 7131e14847454a49a5bbd2cb428e3e67 |          24
 2023-10-05 17:28:01.860131 | 9fe656d750ae4a5c8d8236eed69d7f2b |          24
 2023-10-05 20:19:41.701115 | 8a68c0b0b8ce41db8631e8e236af9235 |          24
 2023-10-05 23:26:28.99192  | f4d6b6343f7a436ba3894f871a524b0c |          24
 2023-10-06 02:22:16.671595 | 8016d5c3fc3e413ca0e8ab7f7bb52e1c |          24
 2023-10-06 05:28:59.872214 | 141651da421049f3b1669d7a3f5a4d88 |          24
 2023-10-06 08:24:59.470588 | 69103244469d4b01bbe493f39194a3fb |          24
 2023-10-06 11:20:45.434535 | c462c49764b5442fa3ac1af671f88b08 |          24
 2023-10-06 14:29:45.198447 | a2af089b4fa1421cbed1d59b18694a87 |          24
 2023-10-06 17:24:26.104814 | 200947668e5c41a79243aaf31ab42380 |          24
 2023-10-06 20:07:02.596901 | 5627438887054b109842daf9287b4332 |           0
 2023-10-06 22:50:22.449426 | 0ecb0e4f83d0406593087e08c0b61fb3 |           1
 2023-10-07 02:57:22.516069 | f6ccd1c4b5fb41c396811b76eef3902d |           0
 2023-10-07 04:32:45.674441 | 36e00707836945e188083a5ebc84f3f6 |           1
 2023-10-07 23:33:45.35763  | cb02ebcfc6d54f649db4dbae0f86c442 |           2
 2023-10-08 22:09:50.399974 | 58f58379c6c545a0b8d4a82e7d894626 |           1
 2023-10-10 01:12:00.395274 | e79336051c534dd0be81d1280d462968 |           1
 2023-10-10 04:07:00.964479 | 6bb312c6084840ef97269ae527a759d1 |           0
 2023-10-10 06:09:36.389056 | 309cc226fc3d4322baaf69faff63c956 |           1
 2023-10-10 08:55:24.822424 | a91a6a2060c545f8b13ad0fe2e082475 |           0
 2023-10-10 11:04:34.424189 | b0b0939418b14986a79b32547489c6a1 |           0
 2023-10-10 13:31:12.50375  | f16770c6bc95419ba5011d10ea3a974d |           0
 2023-10-10 16:07:28.651402 | 870f5c2b5b7c427fbc8a07c8708c2434 |           1
 2023-10-10 18:33:16.425559 | 7fe710688b4d47f6a3932bfab9c599f7 |           1
 2023-10-10 23:12:45.442182 | adf798f0306b44ca902b24e95182072b |           0
 2023-10-11 01:06:54.330537 | e053efe4fccb48d8923ca49f22aa3928 |           0
 2023-10-11 02:51:32.430454 | c985261e64e441fbb2d8da74ee0a271e |           1
 2023-10-11 05:23:57.397817 | 3cedb97c4d0e4ca48c392562d7822c71 |           0
 2023-10-11 07:53:29.72254  | 517c2fb5d61645c0937a4a80e4c02ab1 |           1
 2023-10-11 10:22:12.897347 | 0a40322ac5ad4cc69b1e2da555cc9884 |           1
 2023-10-11 13:45:02.667613 | 05f2a40424984d62808b6600a7be60e8 |           0
 2023-10-11 15:23:16.871397 | 2dcd6824d73c4e238bbc36a77578eafc |           2
 2023-10-11 17:51:40.61575  | 94b8fa73017b4526bb66c179a67c97f5 |           1
 2023-10-11 20:21:02.843915 | 4a441cef677240cfbc593b858d5e6fc5 |           1
 2023-10-11 22:52:02.991529 | da5bc2af4a9e4e849115de5f872e1a43 |           2
 2023-10-12 01:23:48.596752 | 4912265bd177431e9ccab6a865979bc2 |           0
 2023-10-12 04:56:52.251901 | db4cf82bc3e1491ba1500984e727eb02 |           0
 2023-10-12 06:17:58.084406 | 2fd42faf48404478affd58f9bcfbb26e |           1
 2023-10-12 08:52:12.650021 | bb99d7aba6ce4bd79c1b9bbf0ace732f |           2
 2023-10-12 11:23:30.072888 | 8538c42b61e24b8f897eecba8ab51084 |           2
 2023-10-12 14:35:46.545025 | 461fd9038239478199c4ce554783b107 |           0
 2023-10-12 17:12:39.307677 | e09f15cd85d8455f80562dbb91f37389 |           0
 2023-10-12 18:51:16.012983 | 2f6706c923c144ae89eddf11447734bc |           0
 2023-10-12 21:24:05.360257 | 4b879813b1fd466cac3bd9a42b5f0eff |           0
 2023-10-12 23:52:24.865696 | 810555452bb7453eb3637d74cdac4f05 |           1
 2023-10-13 02:24:02.374777 | 6b2f3f65be194fa8aac8c854d4491958 |           0
 2023-10-13 04:53:12.668694 | e9c7b12f180944fd9339d1acc89f27dc |           1
 2023-10-13 07:19:54.560573 | 85c7fdd868d04e6aa61c898cf5a9f3a5 |           1
 2023-10-13 10:33:45.433656 | d8e1f73c854e4d2fa7dcc67f9b28dc53 |           0
 2023-10-13 13:22:57.123444 | e6c83373db21434f940879904d46fbfc |           0
 2023-10-13 15:01:25.231645 | ab8fb74aed7741aba3d2ea6633c060d2 |           1
 2023-10-13 18:30:03.608629 | 0122c1ead02d439f97e113243f87d337 |           0
 2023-10-13 21:46:14.986946 | 7d324417a2f14070bb6a05ce184ae250 |           0
 2023-10-14 00:40:17.52854  | 5f5c93257e234f21a40a6d0f875a9cc9 |           0
 2023-10-14 02:10:36.812122 | 8b67a831fb3947c599bc638ba6b2269a |           2
 2023-10-16 10:04:47.249041 | 5635a683e0b94470bcf0b3ce35ec1f9b |           1
 2023-10-16 12:52:22.633604 | 6b2c2f77b61a4a39a6b33f834edd6b5d |           1
 2023-10-16 17:22:58.089709 | f4fc993a105944aeabe04a56d6fc3a9e |           0
 2023-10-16 18:09:30.601348 | 75f834400473428a87ff70013ccc684f |           0
 2023-10-17 01:42:14.372681 | c5d83122840d4afc827dac61c8f7df22 |           1
 2023-10-17 05:01:10.351914 | f3a18df396fa485ba6cf49231fae70fd |           0
 2023-10-17 08:44:35.353271 | fa505501b1284dcd9ce7a347decbff54 |           0
 2023-10-17 16:31:06.679952 | 56a088bf01764184bf34d59c108ee4e0 |           1
 2023-10-17 20:40:37.481827 | fc2b2170fda04f568c7e5c9a47e7de95 |           0
 2023-10-18 09:35:43.28765  | b57227ca04d64bcaa63b3a311b6f6743 |           0
 2023-10-18 11:12:48.319844 | 7f6271767bf741bf91ea38ae207b7cea |           0
 2023-10-18 14:57:54.460282 | 0ff935e33bc04e8e9d833f99187b8a72 |           0
 2023-10-18 16:10:27.399072 | d379f727daef47d0b14f7647e3ef089b |           1
 2023-10-19 02:44:17.844343 | 03302b3b966049a6ace506ccdb307395 |           1
 2023-10-19 11:32:19.094086 | 8660f5699ed84c58a5b316432271d9a0 |           0
 2023-10-19 13:53:01.875819 | 04f8720de29146db9a50344477afe4cc |           1
 2023-10-19 17:35:08.017099 | 6b5cc6e3f4ac4b45bca66184ee8487ca |           1
 2023-10-19 20:03:41.026884 | aeca00a62f38400baa34aad1782ffd5d |           3
 2023-10-19 22:33:13.40542  | ea9f57e2e9444fc8ab74ca163a2ef4d6 |           1
 2023-10-20 09:51:30.247996 | b97ffbffbe28497d9cda75b124bfb704 |           1
 2023-10-22 18:22:19.403434 | 779a94ec29b649c29c5e4d1be968d6e0 |           1
 2023-10-23 15:02:29.371518 | 9acfb5dd28cc48bcb5bcf57d6ba0cdf7 |           0
austin3dickey commented 11 months ago

The first run without 24 results was this one: https://conbench.ursa.dev/runs/5627438887054b109842daf9287b4332/

On commit https://github.com/apache/arrow/commit/d7017dd0dc567969c79d14aefc3d5a638e66270a, which has the message GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files (#37854). Interesting!

austin3dickey commented 11 months ago

@jorisvandenbossche It looks like the dataset-serialize benchmark started segfaulting after https://github.com/apache/arrow/pull/37854 was merged. Do you think we'll need to make changes to how the benchmark is run or is there something that needs to be fixed on the Arrow side?

austin3dickey commented 11 months ago

We could probably just set pre_buffer=False. Or someone could research whether there's a way to consistently avoid the segfault (which I'm assuming is memory-related? not quite sure) even with pre_buffer=True.

I think it depends on what the Arrow community wants to actually be measuring here. For instance, it may not make sense to compare the benchmark timings measured with and without pre_buffer.

jorisvandenbossche commented 11 months ago

We could probably just set pre_buffer=False

We could do that short-term to get the benchmark working again. The benchmark is actually about writing if I am reading it correctly, and so it segfaults in the setup, thus changing this won't impact the actual benchmark.

(although it is a bit strange that it still logs the timing info after reading)

But the change that started this (pre_buffer default change) should not cause a segfault. If that is happening, that's a critical bug, and something we should still try to reproduce outside of the benchmarks.

We did have some crashes on the main Arrow CI as well after merging that PR, but those were fixed with https://github.com/apache/arrow/pull/38073

austin3dickey commented 11 months ago

Okay, I opened https://github.com/apache/arrow/issues/38438. I'll try to see if using pre_buffer=False fixes the problem.

austin3dickey commented 11 months ago

I was able to avoid the segfault locally by setting pre_buffer=False in https://github.com/voltrondata-labs/benchmarks/pull/152. Once I merge that, this issue can be closed.

Like you said though, https://github.com/apache/arrow/issues/38438 seems like a critical bug.

austin3dickey commented 11 months ago

That band-aid worked: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/3689#018b64a3-2deb-4824-96ee-3b13c7c67261/6-24113

PASSED Python dataset-serialize 0:27:17.999837

jorisvandenbossche commented 11 months ago

Thanks for opening the issue! Will try to further look into that tomorrow.

mapleFU commented 11 months ago

I've try to fix it here: https://github.com/apache/arrow/pull/38466

Not sure this really fix the bug, you can have a try here...