phac-nml / cladeomatic

SNP population structure detection
Apache License 2.0
3 stars 0 forks source link

Raylet failure #7

Open jhawkey opened 1 year ago

jhawkey commented 1 year ago

Hi, I'm attempting to run cladeomatic create on the small test data provided, but I am having issues with raylet?

This is the end of what is printed to stdout:


2023-08-03 09:43:28,391 INFO: Performing canonical SNP detection [in /scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/cladeomatic/create.py:1038]
(raylet) [2023-08-03 09:43:29,587 E 99423 99479] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
(raylet) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
Traceback (most recent call last):
  File "/scratch/js66/jane/conda_envs/cladeomatic/bin/cladeomatic", line 10, in <module>
    sys.exit(main())
  File "/scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/cladeomatic/main.py", line 46, in main
    exec('cladeomatic.' + task + '.run()')
  File "<string>", line 1, in <module>
  File "/scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/cladeomatic/create.py", line 1039, in run
    cw = clade_worker(filtered_vcf, metadata , distance_matrix_file, group_data, ref_seq[ref_seq_id], mode,perform_compression=perform_compression,delim=delim,
  File "/scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/cladeomatic/clades.py", line 90, in __init__
    self.workflow()
  File "/scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/cladeomatic/clades.py", line 102, in workflow
    self.snp_data = snp_search_controller(self.group_data, self.vcf_file, self.num_threads)
  File "/scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/cladeomatic/snps.py", line 90, in snp_search_controller
    results = ray.get(result_ids)
  File "/scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/scratch/js66/jane/conda_envs/cladeomatic/lib/python3.9/site-packages/ray/_private/worker.py", line 2495, in get
    raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.```

I'm running it with 16Gb of memory available, so I don't think that is the issue. 

This is the output of the pip freeze command that raylet suggests to see if the correct version is being used, not sure what the version should be but putting the output here in case it's helpful.

```$ pip freeze | grep grpcio
grpcio @ file:///home/conda/feedstock_root/build_artifacts/grpc-split_1675287624183/work```

I installed via conda.

Any suggestions? Thanks!