vanheeringen-lab / ANANSE

Prediction of key transcription factors in cell fate determination using enhancer networks. See full ANANSE documentation for detailed installation instructions and usage examples.
http://anansepy.readthedocs.io
MIT License
77 stars 16 forks source link

ananse network Out Of Memory (OOM) error and killed #96

Open apposada opened 3 years ago

apposada commented 3 years ago

Hi,

I am trying to run ananse (pip virtual environment; last updated yesterday) on a 64thread / 110G RAM machine. While ananse binding runs fine, ananse network progresses up to ~95% of network constructed when it is killed by the system due to an out of memory error.

The command I am running is:

(ananse_pip_venv) [aperpos@nodos ananse]$ ananse network -n 12 -b 20210615_EB.binding/binding.tsv -e rna/02_EB_02.tsv -g genome/genome.fa -a annot/annot.bed12.bed -o 20210615_EB.network
2021-06-16 17:08:54 | INFO | Loading expression
2021-06-16 17:08:54 | INFO | creating expression dataframe
2021-06-16 17:08:55 | INFO | Aggregate binding
2021-06-16 17:08:55 | INFO | reading enhancers
2021-06-16 17:10:30 | INFO | Reading binding file...
2021-06-16 17:10:31 | INFO | Grouping by tf and target gene...
2021-06-16 17:10:31 | INFO | Done grouping...
2021-06-16 17:10:31 | INFO | Reading factor activity
2021-06-16 17:10:31 | INFO | Computing network
[######################################  ] | 96% Completed | 24min  5.2s

That prompt stays as 96% and a fixed unchanging time for a while, until it is killed.

RAM usage remains at near half of total available (~50-60GB) for a while, then it slowly increases until the process is killed.

Prior to interruption, a number of child processes are generated that remain in sleep status. The number of child processes increases over time. Tracing their calls returns the following (an excerpt below):

Process 28879 attached
restart_syscall(<... resuming interrupted call ...>) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f5a39f5d584, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 39066281, {1623858473, 51401000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f5a39f5d540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f5a39f5d584, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 39066317, {1623858473, 56565000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f5a39f5d540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f5a39f5d584, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 39066379, {1623858473, 61670000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f5a39f5d584, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 39066383, {1623858473, 61670000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

After interruption, running dmesg and looking for 'ananse' shows the following:

(ananse_pip_venv) [aperpos@nodos ananse]$ dmesg | grep ananse
[10723591.785446] [ 5804]  1434  5804 21963955 18508977   37336        0             0 ananse
[10723591.785476] Out of memory: Kill process 5804 (ananse) score 621 or sacrifice child
[10723591.794398] Killed process 5804 (ananse) total-vm:87855820kB, anon-rss:74035908kB, file-rss:0kB, shmem-rss:0kB

If I understand correctly, ananse allocates 87855820kB (=~83GB) of memory, which leaves the system under half the amount of memory. Is this an expected behavior? Is it possible for a limit to be set up by the user, or can this behavior be modified?

Really looking forward to use this tool!

All the best, Alberto

Maarten-vd-Sande commented 3 years ago

Hi Alberto,

Thanks for the issue and the interest! Unfortunately ANANSE is extremely memory hungry :disappointed:, mainly due to our naive design, but also because of the way Python works with threads/processes. The way Python code (often) gets parallelized is by making a complete copy of what is in memory and running that, no memory is shared. This is exactly how it is also implemented for ANANSE, and means that the more cores you use, the higher the memory usage.

I am expecting that using less cores (e.g. 4) will give a more acceptible memory usage than what it is currently requesting for you.

Let us know if that helps or not!

apposada commented 3 years ago

Hi Maarten,

Thanks a lot for the suggestion! I'll re-run it again using 4 threads, and I will let you know how it goes.

All the best, Alberto

simonvh commented 3 years ago

@apposada if the solution from @Maarten-vd-Sande doesn't help, there is a fix that you can try. You'll have to run the following command in your environment to install the develop version of ANANSE with this fix:

pip install git+https://github.com/vanheeringen-lab/ANANSE.git@refs/pull/97/merge

We haven't thoroughly tested this yet, but it should keep memory usage within ~12-15GB. The -n, --ncore parameter is removed in this version, as this implementation is not CPU-bound and changing this doesn't effect the running time.

simonvh commented 3 years ago

@apposada scratch that, there is an occasional error that pops up in that version that seems to affect the results as well. Still have to look into that more deeply.

apposada commented 3 years ago

Hi Simon, Maarten,

I'm afraid to say that Maarten's suggestion did not work either. I ran it using 4 cores but still there was a memory overload.

Again, when looking at dmesg:

[10795571.015983] CPU: 51 PID: 40267 Comm: ananse Not tainted 3.10.0-514.10.2.el7.x86_64 #1
[10795571.016564] [40047]  1434 40047 26446001 23052926   46085        0             0 ananse
[10795571.016584] Out of memory: Kill process 40047 (ananse) score 773 or sacrifice child
[10795571.025703] Killed process 40047 (ananse) total-vm:105784004kB, anon-rss:92211704kB, file-rss:0kB, shmem-rss:0kB
[10795571.045796] [40262]  1434 40047 26446001 23052994   46085        0             0 ananse
[10795571.045829] Out of memory: Kill process 40289 (ananse) score 773 or sacrifice child

Thanks and all the best, Alberto

simonvh commented 3 years ago

HI @apposada There is somewhat of a fix, that you can install and try out as follows:

#activate your environment
pip install git+https://github.com/vanheeringen-lab/ANANSE.git@refs/pull/97/merge

This version has the option to control memory usage (at the expense of speed). Each additional core uses ~12GB of memory. Running it with one core will take likely 1 hour, with 4 cores it will take ~48GB of memory, but run in ~15 minutes.

The strange thing is that 8GB of that 12GB per core is not ANANSE related. It is related to the dask framework that we use, but it is really unclear how that comes to be and how it can be fixed. It seems to be some strange bug, and Google turns up nothing.

cdsoria commented 3 years ago

Hi @simonvh, I was actually waiting for the fix as I have the same issue of memory. Using the ananse network I got to 65% and then it is killed. In the high memory cloud, it is killed at 25%. Thank you!!! However, with this fix, it breaks before with the following error (in my local computer):

ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 8 -o fibroblast.network.txt

2021-06-25 18:49:44 | INFO | Loading expression 2021-06-25 18:50:04 | INFO | creating expression dataframe 2021-06-25 18:51:20 | INFO | Aggregate binding 2021-06-25 18:51:20 | INFO | reading enhancers 2021-06-25 18:51:36 | INFO | Reading binding file... 2021-06-25 18:51:42 | INFO | Grouping by tf and target gene... 2021-06-25 18:51:42 | INFO | Done grouping... 2021-06-25 18:51:42 | INFO | Reading factor activity Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 196, in read frames_nbytes = await stream.read_bytes(fmt_size) asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/asyncio/tasks.py", line 492, in wait_for fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect handshake = await asyncio.wait_for(comm.read(), time_left()) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/asyncio/tasks.py", line 494, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/bin/ananse", line 326, in args.func(args) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/commands/network.py", line 43, in network b.run_network( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/network.py", line 594, in run_network progress(result) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/diagnostics/progressbar.py", line 439, in progress TextProgressBar(futures, complete=complete, *kwargs) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/diagnostics/progressbar.py", line 122, in init loop_runner.run_sync(self.listen) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 494, in run_sync return sync(self.loop, func, args, **kwargs) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 338, in sync raise exc.with_traceback(tb) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 321, in f result[0] = yield future File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/diagnostics/progressbar.py", line 64, in listen self.comm = await connect( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 325, in connect raise IOError( OSError: Timed out during handshake while connecting to tcp://127.0.0.1:52255 after 10 s /Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

apposada commented 3 years ago

Hi Simon,

I tried updating using pip, as you suggested, and later I was asked to install dask distributed (which I did using: python -m pip install "dask[distributed]" --upgrade # or python -m pip install ).

After installing and re-trying with both 1 and 4 cores, I got the following. Which seems unrelated to the error by @cdsoria , and potentially nothing to do with the OOM issue, but still I am reporting it here. No luck so far it seems...

Thanks

(ananse_pip_venv) [aperpos@nodos ananse]$ ananse network -n 4 -b 20210615_EB.binding/binding.tsv -e rna/02_EB_02.tsv -g genome/genome.fa -a annot/annot.bed12.bed -o 20210626_EB.network
2021-06-26 18:20:26 | INFO | Loading expression
Traceback (most recent call last):
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/bin/ananse", line 326, in <module>
    args.func(args)
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/lib64/python3.6/site-packages/ananse/commands/network.py", line 44, in network
    outfile=args.outfile,
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/lib64/python3.6/site-packages/ananse/network.py", line 556, in run_network
    fin_expression, tfs=tfs, rank=True, bindingfile=binding
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/lib64/python3.6/site-packages/ananse/network.py", line 495, in create_expression_network
    tf_fname = self._save_temp_expression(tmp, "tf")
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/lib64/python3.6/site-packages/ananse/network.py", line 421, in _save_temp_expression
    tmp[f"{name}_expression"] = minmax_scale(tmp[f"{name}_expression"].rank())
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/lib64/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/lib64/python3.6/site-packages/sklearn/preprocessing/_data.py", line 546, in minmax_scale
    dtype=FLOAT_DTYPES, force_all_finite='allow-nan')
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/lib64/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/ska/aperpos/Def_Pfla/outputs/ananse/ananse_pip_venv/lib64/python3.6/site-packages/sklearn/utils/validation.py", line 729, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
simonvh commented 3 years ago

Thanks so much @cdsoria and @apposada for providing this input and bug reports. I'm sorry it's such a hassle for you, but I'm thankful you provide the feedback which allows us to hopefully fix these issues.

@cdsoria can you try with a lower value for n, say -n 2?

@apposada I suspect this has something to do with the format of the input files. Would it be possible for you provide the output of head for all the input files? Another possibility is that the support for non-human is not yet as stable as we'd like to. Also in that case a sample of the input would help.

cdsoria commented 3 years ago

Hello @simonvh. No problem at all and happy to help. I tried with n2, n1 and no 'n',, but I get the same error. It seems something about wanting to connect. Apologies, I dont really understand the error well.

ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 1 -o fibroblast.network.txt

2021-06-28 13:22:03 | INFO | Loading expression
2021-06-28 13:22:22 | INFO | creating expression dataframe
2021-06-28 13:23:46 | INFO | Aggregate binding
2021-06-28 13:23:46 | INFO | reading enhancers
2021-06-28 13:24:03 | INFO | Reading binding file...
2021-06-28 13:24:08 | INFO | Grouping by tf and target gene...
2021-06-28 13:24:08 | INFO | Done grouping...
2021-06-28 13:24:08 | INFO | Reading factor activity
Traceback (most recent call last):
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 196, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/bin/ananse", line 326, in
args.func(args)
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/commands/network.py", line 43, in network
b.run_network(
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/network.py", line 594, in run_network
progress(result)
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/diagnostics/progressbar.py", line 439, in progress
TextProgressBar(futures, complete=complete, **kwargs)
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/diagnostics/progressbar.py", line 122, in init
loop_runner.run_sync(self.listen)
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 494, in run_sync
return sync(self.loop, func, *args, **kwargs)
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 338, in sync
raise exc.with_traceback(tb)
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 321, in f
result[0] = yield future
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/diagnostics/progressbar.py", line 64, in listen
self.comm = await connect(
File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 325, in connect
raise IOError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:54140 after 10 s
/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
simonvh commented 3 years ago

@cdsoria The error is completely unclear to be, but it seems like the jobs are cancelled somehow, after which this error is thrown. To check if this is related to available resources, can you check the following. This will download a small test data set, and run ananse network just to see if that does work:

git clone https://github.com/simonvh/ANANSE
cd ANANSE/
git checkout network_memory
ananse network -b tests/data/network/binding.tsv.gz -e tests/data/network/heart_expression.txt -o test_network.txt -n 2
cdsoria commented 3 years ago

Thanks @simonvh so it goes a bit further but stops at Computing network ananse network -b tests/data/network/binding.tsv.gz -e tests/data/network/heart_expression.txt -o test_network.txt -n 2

2021-06-28 16:27:05 | INFO | Loading expression 2021-06-28 16:27:05 | INFO | creating expression dataframe 2021-06-28 16:27:05 | INFO | Aggregate binding 2021-06-28 16:27:05 | INFO | reading enhancers 2021-06-28 16:27:08 | INFO | Reading binding file... 2021-06-28 16:27:08 | INFO | Grouping by tf and target gene... 2021-06-28 16:27:08 | INFO | Done grouping... 2021-06-28 16:27:08 | INFO | Reading factor activity 2021-06-28 16:27:11 | INFO | Using tf_expression, target_expression, weighted_binding, activity 2021-06-28 16:27:11 | INFO | Computing network Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/bin/ananse", line 326, in args.func(args) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/commands/network.py", line 43, in network b.run_network( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/network.py", line 620, in run_network os.makedirs(dirname, exist_ok=True) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) FileNotFoundError: [Errno 2] No such file or directory: ''

cdsoria commented 3 years ago

@simonvh Just to say that, when I go back to my previous version of Ananse ("pip install git+https://github.com/vanheeringen-lab/ANANSE.git@9de0982"), your test data runs ok. I am re-running again with the full data at the moment just to make sure that the memory runs out as before. Yep, with the old version runs out at 66% Completed. 2021-06-28 17:25:58 | INFO | Loading expression 2021-06-28 17:25:58 | INFO | creating expression dataframe 2021-06-28 17:25:58 | INFO | Aggregate binding 2021-06-28 17:25:58 | INFO | reading enhancers 2021-06-28 17:26:00 | INFO | Reading binding file... 2021-06-28 17:26:00 | INFO | Grouping by tf and target gene... 2021-06-28 17:26:00 | INFO | Done grouping... 2021-06-28 17:26:00 | INFO | Reading factor activity 2021-06-28 17:26:00 | INFO | Computing network [########################################] | 100% Completed | 3.3s 2021-06-28 17:26:04 | INFO | Using tf_expression, target_expression, weighted_binding, activity 2021-06-28 17:26:04 | INFO | Saving file

simonvh commented 3 years ago

Okay, @cdsoria, another try. The high memory usage was related to a really obscure issue with another library. I've fixed it, and as a result the memory usage (at least on our server) has decrease significantly.

Can you run the following command again and check if it works after that?

pip install git+https://github.com/vanheeringen-lab/ANANSE.git@refs/pull/97/merge
cdsoria commented 3 years ago

@simonvh Thank you so much again. So this time it goes further to 99% but then it hangs in there, as it stops and re-starts workers. I tested this in my local computer (error attached) but also in an AWS EC2 instance r5d.24xlarge with similar error once it reaches 99%. Also tested with -n 2 ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 4 -o fibroblast.network.txt

It gives this error to start with but it actually starts

Traceback (most recent call last): File "", line 1, in File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/runpy.py", line 268, in run_path return _run_module_code(code, init_globals, run_name, File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/bin/ananse", line 19, in from ananse import commands, version File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/commands/init.py", line 1, in from ananse.commands.binding import binding # noqa File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/commands/binding.py", line 7, in from ananse.peakpredictor import predict_peaks File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/peakpredictor.py", line 10, in from gimmemotifs.motif import read_motifs File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/gimmemotifs/init.py", line 61, in from . import denovo # noqa: F401 File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/gimmemotifs/denovo.py", line 51, in from gimmemotifs.stats import calc_stats, rank_motifs, write_stats File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/gimmemotifs/stats.py", line 10, in from gimmemotifs.scanner import scan_to_best_match, Scanner File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/gimmemotifs/scanner.py", line 58, in config = MotifConfig() File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/gimmemotifs/config.py", line 95, in init self._upgrade_config() File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/gimmemotifs/config.py", line 98, in _upgrade_config if "width" in self.config["params"]: File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/configparser.py", line 960, in getitem raise KeyError(key) KeyError: 'params' distributed.nanny - WARNING - Restarting worker 2021-06-29 20:47:58 | INFO | Loading expression 2021-06-29 20:47:58 | INFO | Aggregate binding 2021-06-29 20:47:58 | INFO | reading enhancers 2021-06-29 20:48:14 | INFO | Reading binding file... 2021-06-29 20:48:20 | INFO | Grouping by tf and target gene... 2021-06-29 20:48:20 | INFO | Done grouping... 2021-06-29 20:48:20 | INFO | Reading factor activity 2021-06-29 20:48:20 | INFO | Computing network [###### ] | 17% Completed | 2min 37.0sdistributed.worker - WARNING - Worker is at 84% memory usage. Pausing worker. Process memory: 9.44 GiB -- Worker memory limit: 11.18 GiB [###### ] | 17% Completed | 2min 39.3sdistributed.worker - WARNING - Worker is at 54% memory usage. Resuming worker. Process memory: 6.13 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 13min 16.5sdistributed.worker - WARNING - Worker is at 86% memory usage. Pausing worker. Process memory: 9.65 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 13min 28.9sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting [####################################### ] | 99% Completed | 13min 29.9sdistributed.nanny - WARNING - Restarting worker [####################################### ] | 99% Completed | 13min 37.1sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 7.99 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 13min 47.3sdistributed.worker - WARNING - Worker is at 81% memory usage. Pausing worker. Process memory: 9.15 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 13min 49.9sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 10.26 GiB -- Worker memory limit: 11.18 GiB distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 10.26 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 13min 50.3sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting [####################################### ] | 99% Completed | 13min 51.1sdistributed.nanny - WARNING - Restarting worker [####################################### ] | 99% Completed | 16min 52.8sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 9.21 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 16min 53.0sdistributed.worker - WARNING - Worker is at 82% memory usage. Pausing worker. Process memory: 9.21 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 16min 59.3sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 8.18 GiB -- Worker memory limit: 11.18 GiB distributed.worker - WARNING - Worker is at 73% memory usage. Resuming worker. Process memory: 8.18 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 23min 29.2s^CTraceback (most recent call last):

simonvh commented 3 years ago

Huh, that worker memory is too high, it should no reach more than 4/5GB. Can you double-check something for me, and post the output of this:

cat /Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/__init__.py
simonvh commented 3 years ago

You can also try running the command like this

OMP_NUM_THREADS=1 ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 4 -o fibroblast.network.txt
cdsoria commented 3 years ago

Huh, that worker memory is too high, it should no reach more than 4/5GB. Can you double-check something for me, and post the output of this:

cat /Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/__init__.py

Hi @simonvh, this is the output:

cat /Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/__init__.py

from ._version import get_versions
import os

# This is here to prevent very high memory usage on numpy import.
# On a machine with many cores, just importing numpy can result in up to
# 8GB of (virtual) memory. This wreaks havoc on management of the dask
# workers.
os.environ["OMP_NUM_THREADS"] = "1"

__version__ = get_versions()["version"]
del get_versions``
cdsoria commented 3 years ago

You can also try running the command like this

OMP_NUM_THREADS=1 ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 4 -o fibroblast.network.txt

Hi again @simonvh .Still the same persists unfortunately:

`OMP_NUM_THREADS=1 ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 4 -o fibroblast.network.txt

2021-06-30 10:28:58 | INFO | Loading expression 2021-06-30 10:28:59 | INFO | Aggregate binding 2021-06-30 10:28:59 | INFO | reading enhancers 2021-06-30 10:29:15 | INFO | Reading binding file... 2021-06-30 10:29:21 | INFO | Grouping by tf and target gene... 2021-06-30 10:29:21 | INFO | Done grouping... 2021-06-30 10:29:21 | INFO | Reading factor activity 2021-06-30 10:29:22 | INFO | Computing network [# ] | 4% Completed | 39.9sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting [# ] | 4% Completed | 40.6sdistributed.nanny - WARNING - Restarting worker [## ] | 7% Completed | 1min 3.8sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting [## ] | 7% Completed | 1min 4.5sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:49371 Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read n = await stream.read_into(frames) tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2334, in gather_dep response = await get_data_from_worker( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3753, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3733, in _get_data response = await send_recv( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv response = await comm.read(deserializers=deserializers) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read convert_stream_closed_error(self, e) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError("in %s: %s" % (obj, exc)) from exc distributed.comm.core.CommClosedError: in : Stream is closed distributed.nanny - WARNING - Restarting worker [####### ] | 18% Completed | 3min 31.3sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 8.92 GiB -- Worker memory limit: 11.18 GiB distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 8.92 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 15min 3.4sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting [####################################### ] | 99% Completed | 15min 4.1sdistributed.nanny - WARNING - Restarting worker [####################################### ] | 99% Completed | 15min 11.0sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 8.08 GiB -- Worker memory limit: 11.18 GiB distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 8.08 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 15min 11.1sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 8.08 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 15min 11.3sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 8.30 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 15min 18.5sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 10.48 GiB -- Worker memory limit: 11.18 GiB distributed.worker - WARNING - Worker is at 93% memory usage. Pausing worker. Process memory: 10.48 GiB -- Worker memory limit: 11.18 GiB distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 10.48 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 15min 18.7sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 10.14 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 15min 19.2sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting [####################################### ] | 99% Completed | 15min 20.0sdistributed.nanny - WARNING - Restarting worker [####################################### ] | 99% Completed | 17min 44.9sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 9.19 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 17min 45.3sdistributed.worker - WARNING - Worker is at 82% memory usage. Pausing worker. Process memory: 9.19 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 18min 4.9sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 6.77 GiB -- Worker memory limit: 11.18 GiB distributed.worker - WARNING - Worker is at 60% memory usage. Resuming worker. Process memory: 6.77 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 18min 14.8sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting [####################################### ] | 99% Completed | 18min 15.7sdistributed.worker - ERROR - failed during get data with tcp://127.0.0.1:49417 -> tcp://127.0.0.1:49372 Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 971, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1148, in write_to_fd return self.socket.send(data) # type: ignore BrokenPipeError: [Errno 32] Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1430, in get_data response = await comm.read(deserializers=serializers) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read convert_stream_closed_error(self, e) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : BrokenPipeError: [Errno 32] Broken pipe distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:49372 Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read n = await stream.read_into(frames) tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2334, in gather_dep response = await get_data_from_worker( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3753, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3733, in _get_data response = await send_recv( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv response = await comm.read(deserializers=deserializers) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read convert_stream_closed_error(self, e) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError("in %s: %s" % (obj, exc)) from exc distributed.comm.core.CommClosedError: in : Stream is closed distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:49665 -> tcp://127.0.0.1:49372 Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 54] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1430, in get_data response = await comm.read(deserializers=serializers) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read convert_stream_closed_error(self, e) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 54] Connection reset by peer [####################################### ] | 99% Completed | 18min 16.2sdistributed.nanny - WARNING - Restarting worker [####################################### ] | 99% Completed | 18min 16.3sdistributed.worker - ERROR - failed during get data with tcp://127.0.0.1:49624 -> tcp://127.0.0.1:49372 Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 971, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1148, in write_to_fd return self.socket.send(data) # type: ignore BrokenPipeError: [Errno 32] Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1430, in get_data response = await comm.read(deserializers=serializers) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read convert_stream_closed_error(self, e) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : BrokenPipeError: [Errno 32] Broken pipe [####################################### ] | 99% Completed | 20min 19.4sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 7.32 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 21min 26.2sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 9.08 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 21min 26.4sdistributed.worker - WARNING - Worker is at 81% memory usage. Pausing worker. Process memory: 9.08 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 21min 32.0sdistributed.worker - WARNING - Worker is at 74% memory usage. Resuming worker. Process memory: 8.33 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 21min 33.7sdistributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 8.56 GiB -- Worker memory limit: 11.18 GiB [####################################### ] | 99% Completed | 25min 13.8s^CTraceback (most recent call last): File "/Users/carmendiaz/opt/miniconda3/envs/ananse/bin/ananse", line 326, in args.func(args) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/commands/network.py", line 41, in network b.run_network( File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/network.py", line 595, in run_network progress(result) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/diagnostics/progressbar.py", line 439, in progress TextProgressBar(futures, complete=complete, *kwargs) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/diagnostics/progressbar.py", line 122, in init loop_runner.run_sync(self.listen) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 494, in run_sync return sync(self.loop, func, args, **kwargs) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 335, in sync e.wait(10) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/threading.py", line 574, in wait signaled = self._cond.wait(timeout) File "/Users/carmendiaz/opt/miniconda3/envs/ananse/lib/python3.9/threading.py", line 316, in wait gotit = waiter.acquire(True, timeout) KeyboardInterrupt`

simonvh commented 3 years ago

So you also get this on an Amazon EC2 instance right? That should make it possible to see if we can reproduce and thereby test this. Do you have the exact steps that you use to create your environment?

cdsoria commented 3 years ago

So you also get this on an Amazon EC2 instance right? That should make it possible to see if we can reproduce and thereby test this. Do you have the exact steps that you use to create your environment?

@simonvh good news!!! Sorry my bad, I should have also tried in the EC2 instances. So, yay! it worked! I am posting for ref. Also it was very fast. Thank you so much!!!!

cat /home/ubuntu/miniconda3/envs/ananse/lib/python3.9/site-packages/ananse/__init__.py
from ._version import get_versions
import os

# This is here to prevent very high memory usage on numpy import.
# On a machine with many cores, just importing numpy can result in up to
# 8GB of (virtual) memory. This wreaks havoc on management of the dask
# workers.
os.environ["OMP_NUM_THREADS"] = "1"

__version__ = get_versions()["version"]
del get_versions

and the command:
OMP_NUM_THREADS=1 ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 4 -o fibroblast.network.txt
2021-06-30 11:22:34 | INFO | Loading expression
2021-06-30 11:22:34 | INFO | Aggregate binding
2021-06-30 11:22:34 | INFO | reading enhancers
2021-06-30 11:22:51 | INFO | Reading binding file...
2021-06-30 11:22:57 | INFO | Grouping by tf and target gene...
2021-06-30 11:22:57 | INFO | Done grouping...
2021-06-30 11:22:57 | INFO | Reading factor activity
2021-06-30 11:22:57 | INFO | Computing network
2021-06-30 11:28:43 | INFO | Using tf_expression, target_expression, weighted_binding, activity
2021-06-30 11:28:46 | INFO | Writing network
simonvh commented 3 years ago

Great! Just out of curiosity, what is the total memory size of the computer on which it failed?

cdsoria commented 3 years ago

Great! Just out of curiosity, what is the total memory size of the computer on which it failed?

Sure, this my hardware overview: Model Name: MacBook Pro Model Identifier: MacBookPro16,1 Processor Name: 8-Core Intel Core i9 Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 8 L2 Cache (per Core): 256 KB L3 Cache: 16 MB Hyper-Threading Technology: Enabled Memory: 64 GB

simonvh commented 3 years ago

Thanks. Mac is indeed a platform that we don't test. If you have some time, can you check if the following works on your Mac?

export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 4 -o fibroblast.network.txt
cdsoria commented 3 years ago

Thanks. Mac is indeed a platform that we don't test. If you have some time, can you check if the following works on your Mac?

export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
ananse network -b fibroblast.binding/binding.tsv -e ANANSE_example_data/RNAseq/fibroblast*TPM.txt -n 4 -o fibroblast.network.txt

Of course happy to try, with those parameters it does worse actually 8% and then starts throwing the errors.

cdsoria commented 3 years ago

@simonvh not sure why but the fix you made pip install git+https://github.com/vanheeringen-lab/ANANSE.git@refs/pull/97/merge is not working anymore. I tested it for a colleague. Not sure if this expected after a while. Sorry for so many reports. Thank you

simonvh commented 3 years ago

The changes have been merged in the develop branch, for the next release. You can now install it with:

pip install git+https://github.com/vanheeringen-lab/ANANSE.git@3ec07af

(This link will be stable btw)

cdsoria commented 3 years ago

Great thank you!!!!

simonvh commented 3 years ago

Based on this issue, and on other reports of memory issues, we completely overhauled ananse binding and ananse network. For now it is still in the development branch, but if you feel adventurous you can already try it:

pip install git+https://github.com/vanheeringen-lab/ANANSE.git@develop

Note: this is incompatible with the output from the current version of ananse binding. You need to rerun ananse binding, which will now generate a binding.h5 file, which can be used as input for ananse network.

Arts-of-coding commented 3 years ago

Dear @simonvh ,

I also started with having memory issues (almost all the RAM being used on the server) using the current version of ananse. I saw this thread and I installed the changes that should fix the need for so much RAM by executing:

pip install git+https://github.com/vanheeringen-lab/ANANSE.git@3ec07af

Unfortunately this did not fix it for me. I got errors regarding workers. Is there any other way to fix this?

This is the output of my init.py file after the pip install:

(base) julian@cn45:~$ cat /vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/ananse/__init__.py
from ._version import get_versions
import os

# This is here to prevent very high memory usage on numpy import.
# On a machine with many cores, just importing numpy can result in up to
# 8GB of (virtual) memory. This wreaks havoc on management of the dask
# workers.
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

__version__ = get_versions()["version"]
del get_versions

It seems that specifying a low number of cores, it does not begin to start network calculation and on a high number of cores it begins calculating only up to a certain point (see below).

Command with two cores (which was manually interrupted, because the last line ran for over 8 minutes without updates):

(ananse) julian@cn45:/ceph/rimlsfnwi/data/moldevbio/zhou/jarts/jupyter_notebooks$ nice -15 OMP_NUM_THREADS=1 ananse network -e /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/jupyter_notebooks/CjStpm.tsv -b /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/lako2021/ANANSE/outs2/CjS/binding.tsv -a /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/genome/hg38/hg38.annotation.bed -o /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/lako2021/ANANSE/outs2/CjS/full_network_includeprom.txt -g /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/genome/hg38/hg38.fa -n 2

2021-07-15 11:22:18 | INFO | Loading expression
2021-07-15 11:22:18 | INFO | Aggregate binding
2021-07-15 11:22:18 | INFO | reading enhancers
2021-07-15 11:22:50 | INFO | Reading binding file...
2021-07-15 11:23:10 | INFO | Grouping by tf and target gene...
2021-07-15 11:23:10 | INFO | Done grouping...
2021-07-15 11:23:11 | INFO | Reading factor activity
2021-07-15 11:23:11 | INFO | Computing network
distributed.worker - WARNING - Worker is at 85% memory usage. Pausing worker.  Process memory: 9.54 GiB -- Worker memory limit: 11.18 GiB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41309
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read
    n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
    response = await send_recv(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:41387 -> tcp://127.0.0.1:41309
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1431, in get_data
    response = await comm.read(deserializers=serializers)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.nanny - WARNING - Restarting worker
distributed.worker - ERROR - Handle missing dep failed, retrying
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2491, in handle_missing_dep
    for dep in deps:
RuntimeError: Set changed size during iteration
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
^C

This is my console command (with 6 cores)

(ananse) julian@cn45:/ceph/rimlsfnwi/data/moldevbio/zhou/jarts/jupyter_notebooks$ nice -15 ananse network -e /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/jupyter_notebooks/CjStpm.tsv -b /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/lako2021/ANANSE/outs2/CjS/binding.tsv -a /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/genome/hg38/hg38.annotation.bed -o /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/lako2021/ANANSE/outs2/CjS/full_network_includeprom.txt -g /ceph/rimlsfnwi/data/moldevbio/zhou/jarts/data/genome/hg38/hg38.fa -n 6
2021-07-15 11:32:52 | INFO | Loading expression
2021-07-15 11:32:52 | INFO | Aggregate binding
2021-07-15 11:32:52 | INFO | reading enhancers
2021-07-15 11:33:25 | INFO | Reading binding file...
2021-07-15 11:33:50 | INFO | Grouping by tf and target gene...
2021-07-15 11:33:50 | INFO | Done grouping...
2021-07-15 11:33:51 | INFO | Reading factor activity
2021-07-15 11:33:51 | INFO | Computing network
[###                                     ] | 9% Completed | 20.2sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:42231
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 196, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3731, in _get_data
    comm = await rpc.connect(worker)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 1012, in connect
    comm = await connect(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/core.py", line 325, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:42231 after 10 s
[######                                  ] | 16% Completed | 41.7sdistributed.worker - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 9.01 GiB -- Worker memory limit: 11.18 GiB
[######                                  ] | 17% Completed |  1min  1.2sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[#######                                 ] | 17% Completed |  1min  2.1sdistributed.nanny - WARNING - Restarting worker
[#######                                 ] | 17% Completed |  1min  6.1sdistributed.worker - WARNING - Worker is at 37% memory usage. Resuming worker. Process memory: 4.16 GiB -- Worker memory limit: 11.18 GiB
[#######                                 ] | 17% Completed |  1min  6.3sdistributed.worker - ERROR - failed during get data with tcp://127.0.0.1:41013 -> tcp://127.0.0.1:38655
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 971, in _handle_write
    num_bytes = self.write_to_fd(self._write_buffer.peek(size))
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1148, in write_to_fd
    return self.socket.send(data)  # type: ignore
BrokenPipeError: [Errno 32] Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1431, in get_data
    response = await comm.read(deserializers=serializers)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: BrokenPipeError: [Errno 32] Broken pipe
[#######                                 ] | 17% Completed |  1min  6.4sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38655
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read
    n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
    response = await send_recv(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
[#######                                 ] | 18% Completed |  1min 18.8sdistributed.worker - ERROR - failed during get data with tcp://127.0.0.1:34621 -> tcp://127.0.0.1:38655
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 1431, in get_data
    response = await comm.read(deserializers=serializers)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38655
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read
    n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
    response = await send_recv(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
[#######                                 ] | 18% Completed |  1min 21.1sdistributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
[#######                                 ] | 19% Completed |  1min 37.9sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[#######                                 ] | 19% Completed |  1min 39.1sdistributed.nanny - WARNING - Restarting worker
[#######                                 ] | 19% Completed |  1min 47.4sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[#######                                 ] | 19% Completed |  1min 48.4sdistributed.nanny - WARNING - Restarting worker
[#######                                 ] | 19% Completed |  1min 56.6sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
[#######                                 ] | 19% Completed |  1min 57.0sdistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:43089
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 200, in read
    n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
    response = await send_recv(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
[#######                                 ] | 19% Completed |  1min 58.2sdistributed.nanny - WARNING - Restarting worker
[#######                                 ] | 19% Completed |  2min  0.3sdistributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker% Completed |  2min  1.6s
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/bin/ananse", line 326, in <module>
    args.func(args)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/ananse/commands/network.py", line 41, in network
    b.run_network(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/ananse/network.py", line 616, in run_network
    result = result.compute()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/dask/base.py", line 285, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 2705, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 2014, in gather
    return self.sync(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 855, in sync
    return sync(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 338, in sync
    raise exc.with_traceback(tb)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils.py", line 321, in f
    result[0] = yield future
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/client.py", line 1879, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ("('merge-92b1078b3a42900fd782ac0b2b147609', 62)", <WorkerState 'tcp://127.0.0.1:34569', name: 4, memory: 0, processing: 75>)
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:34569
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 2335, in gather_dep
    response = await get_data_from_worker(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3754, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/worker.py", line 3734, in _get_data
    response = await send_recv(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 647, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError(
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 498, in handle_comm
    result = await result
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5002, in add_client
    self.remove_client(client=client)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5029, in remove_client
    self.client_releases_keys(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 4769, in client_releases_keys
    self.transitions(recommendations)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 6683, in transitions
    self.send_all(client_msgs, worker_msgs)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5265, in send_all
    w = stream_comms[worker]
KeyError: None
tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection.<locals>.<lambda> at 0x1507069f4e50>, <Task finished name='Task-56' coro=<BaseTCPListener._handle_stream() done, defined at /vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py:476> exception=KeyError(None)>)
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/tornado/tcpserver.py", line 331, in <lambda>
    gen.convert_yielded(future), lambda f: f.result()
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/comm/tcp.py", line 493, in _handle_stream
    await self.comm_handler(comm)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/core.py", line 498, in handle_comm
    result = await result
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5002, in add_client
    self.remove_client(client=client)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5029, in remove_client
    self.client_releases_keys(
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 4769, in client_releases_keys
    self.transitions(recommendations)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 6683, in transitions
    self.send_all(client_msgs, worker_msgs)
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/site-packages/distributed/scheduler.py", line 5265, in send_all
    w = stream_comms[worker]
KeyError: None
Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join:
Traceback (most recent call last):
  File "/vol/mbconda/julian/envs/ananse/lib/python3.9/threading.py", line 954, in _bootstrap_inner
simonvh commented 3 years ago

Can you try the new (just released!) version: 0.3.0?