maelstrom-software / maelstrom

Maelstrom is a fast Rust, Go, and Python test runner that runs every test in its own container. Tests are either run locally or distributed to a clustered job runner.
https://maelstrom-software.com/
Apache License 2.0
471 stars 10 forks source link

Worker starts returning errors and needs to be restarted #394

Closed nfachan closed 1 week ago

nfachan commented 2 weeks ago

Specific Maelstrom Program? maelstrom-worker

Bug Description Eventually the worker starts logging errors like this:

Aug 28 02:08:20.536 ERRO Got error servicing FUSE request. Returning EIO, error: FileMetadataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/file.rs:47:13

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/bottom_fs_layer/sha256/10e0d3501394f3686f9b70446bbe0825cf38ccad693673b0ceed282ebfbf53e7/attributes_table.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "dispatcher::tests::jobs_are_executed_in_lpt_order"], program: "/maelstrom_worker-0ed3e382bb7cf525", jid: JobId { cid: ClientId(37), cjid: ClientJobId(0) }
Aug 28 02:08:20.536 ERRO Got error servicing FUSE request. Returning EIO, error: FileMetadataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/file.rs:47:13

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/bottom_fs_layer/sha256/10e0d3501394f3686f9b70446bbe0825cf38ccad693673b0ceed282ebfbf53e7/attributes_table.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "dispatcher::tests::jobs_are_executed_in_priority_then_lpt_order"], program: "/maelstrom_worker-0ed3e382bb7cf525", jid: JobId { cid: ClientId(37), cjid: ClientJobId(1) }
Aug 28 02:08:30.467 ERRO Got error servicing FUSE request. Returning EIO, error: FileMetadataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/file.rs:47:13

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/bottom_fs_layer/sha256/10e0d3501394f3686f9b70446bbe0825cf38ccad693673b0ceed282ebfbf53e7/attributes_table.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "dispatcher::tests::jobs_are_executed_in_priority_then_lpt_order"], program: "/maelstrom_worker-0ed3e382bb7cf525", jid: JobId { cid: ClientId(38), cjid: ClientJobId(0) }
Aug 28 02:08:30.467 ERRO Got error servicing FUSE request. Returning EIO, error: FileMetadataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/file.rs:47:13

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/bottom_fs_layer/sha256/10e0d3501394f3686f9b70446bbe0825cf38ccad693673b0ceed282ebfbf53e7/attributes_table.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "dispatcher::tests::jobs_are_executed_in_lpt_order"], program: "/maelstrom_worker-0ed3e382bb7cf525", jid: JobId { cid: ClientId(38), cjid: ClientJobId(1) }
^CAug 28 02:09:03.003 ERRO received SIGINT
Aug 28 02:09:03.005 ERRO shutting down due to signal SIGINT
Aug 28 02:09:03.005 INFO canceling 3 running jobs

It will continue failing every job until it is restarted.

How to Reproduce I don't know how to reproduce it quickly. I just need to use the worker long enough.

Expected Behavior This doesn't happen! :-)

There are all sorts of things it could do:

nfachan commented 2 weeks ago

Here is another example:

jAug 29 21:28:44.809 ERRO Got error servicing FUSE request. Returning EIO, error: DirectoryDataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/dir.rs:32:20

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/upper_fs_layer/sha256/b7a11f6ef2f83c22cb1282262fbbe7f070210ba8ea4658c292179fcd3fc23258/65.dir_data.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "tests::waiting_for_artifacts"], program: "/maelstrom_test_runner-a3e2391ddb15c556", jid: JobId { cid: ClientId(44), cjid: ClientJobId(105) }
Aug 29 21:28:44.809 ERRO Got error servicing FUSE request. Returning EIO, error: DirectoryDataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/dir.rs:32:20

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/upper_fs_layer/sha256/b7a11f6ef2f83c22cb1282262fbbe7f070210ba8ea4658c292179fcd3fc23258/65.dir_data.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "tests::running"], program: "/maelstrom_test_runner-a3e2391ddb15c556", jid: JobId { cid: ClientId(44), cjid: ClientJobId(123) }
Aug 29 21:28:44.810 ERRO Got error servicing FUSE request. Returning EIO, error: DirectoryDataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/dir.rs:32:20

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/upper_fs_layer/sha256/b7a11f6ef2f83c22cb1282262fbbe7f070210ba8ea4658c292179fcd3fc23258/65.dir_data.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "tests::loop_three_times"], program: "/maelstrom_test_runner-a3e2391ddb15c556", jid: JobId { cid: ClientId(44), cjid: ClientJobId(8) }
Aug 29 21:28:44.810 ERRO Got error servicing FUSE request. Returning EIO, error: DirectoryDataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/dir.rs:32:20

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/upper_fs_layer/sha256/b7a11f6ef2f83c22cb1282262fbbe7f070210ba8ea4658c292179fcd3fc23258/65.dir_data.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "tests::expected_count_updates_packages"], program: "/maelstrom_test_runner-a3e2391ddb15c556", jid: JobId { cid: ClientId(44), cjid: ClientJobId(14) }
Aug 29 21:28:44.810 ERRO Got error servicing FUSE request. Returning EIO, error: DirectoryDataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/dir.rs:32:20

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/upper_fs_layer/sha256/b7a11f6ef2f83c22cb1282262fbbe7f070210ba8ea4658c292179fcd3fc23258/65.dir_data.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "tests::expected_count_updates_cases"], program: "/maelstrom_test_runner-a3e2391ddb15c556", jid: JobId { cid: ClientId(44), cjid: ClientJobId(55) }
Aug 29 21:28:44.810 ERRO Got error servicing FUSE request. Returning EIO, error: DirectoryDataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/dir.rs:32:20

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/upper_fs_layer/sha256/b7a11f6ef2f83c22cb1282262fbbe7f070210ba8ea4658c292179fcd3fc23258/65.dir_data.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "tests::stop_after_1_with_estimate"], program: "/maelstrom_test_runner-a3e2391ddb15c556", jid: JobId { cid: ClientId(44), cjid: ClientJobId(53) }
Aug 29 21:28:44.810 ERRO Got error servicing FUSE request. Returning EIO, error: DirectoryDataReader::new at /home/neal/maelstrom/crates/maelstrom-layer-fs/src/dir.rs:32:20

Caused by:
    0: open("/home/neal/.cache/maelstrom/worker/artifacts/upper_fs_layer/sha256/b7a11f6ef2f83c22cb1282262fbbe7f070210ba8ea4658c292179fcd3fc23258/65.dir_data.bin")
    1: Too many open files (os error 24), args: ["--exact", "--nocapture", "test_listing::tests::save_of_listing"], program: "/maelstrom_test_runner-a3e2391ddb15c556", jid: JobId { cid: ClientId(44), cjid: ClientJobId(22) }
Aug 29 21:28:44.978 ERRO Failed to get pipe memory, cannot splice, error: Os { code: 24, kind: Uncategorized, message: "Too many open files" }, args: ["--exact", "--nocapture", "metadata::directive::tests::layers_after_image_with_layers"], program: "/maelstrom_test_runner-a3e2391ddb15c556", jid: JobId { cid: ClientId(44), cjid: ClientJobId(16) }