nickdeveaux commented 6 years ago

@kostyat @dayanne-castro

Calculating Mi and CLR and sending it to workers was sending a large amount of data to each worker per bootstrap. For example, for a 60k gene by 150 sample input file, the mi and clr matrices summed to .6 GB, and ended up being 1.6 GB of data once they were pickled. This was sent to 70 workers across 20 bootstraps on the cluster, leading to a massive (>10x) slowdown.

Now, each worker calculates mi and clr independently, and needs to wait for a new special key (bootstrap %idx) before moving forward

codecov-io commented 6 years ago

Codecov Report

Merging #63 into master will decrease coverage by 0.09%. The diff coverage is 0%.

@@            Coverage Diff            @@
##           master      #63     +/-   ##
=========================================
- Coverage   70.54%   70.44%   -0.1%     
=========================================
  Files          18       18             
  Lines        1480     1482      +2     
=========================================
  Hits         1044     1044             
- Misses        436      438      +2

Impacted Files	Coverage Δ
inferelator_ng/bbsr_tfa_workflow.py	`0% <0%> (ø)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3876f18...b4ea275. Read the comment docs.

dayanne-castro commented 6 years ago

👍

kostyat commented 6 years ago

I tried this on NYU HPC. I submitted the following code into the system using sbatch:

#!/bin/sh

#SBATCH --nodes=3
#SBATCH --tasks-per-node=4
#SBATCH --mem=10GB
#SBATCH --time=2:00:00
#SBATCH --job-name=Infer_Test
#SBATCH --output=Infer_Test_KVS_10GB_3_nodes_4_tasks_pull63.out

module purge
module load r/intel/3.4.2 python/intel/2.7.12 bedtools/intel/2.26.0
source /home/kmt331/inferelator_ng/py2.7/bin/activate

cd /home/kmt331/inferelator_ng
export PYTHONPATH=$PYTHONPATH:$(pwd)/kvsstcp

time python ~/inferelator_ng/kvsstcp/kvsstcp.py --execcmd 'srun -n '${SLURM_NTASKS}' python bsubtilis_bbsr_workflow_runner.py'

When I ran this code using the original code on the master branch, everything ran fine and the results looked fine. But when I switched to the nickdeveaux-ndv_dont_share_mi_clr_but_still_lock_per_bootstrap branch (with the code in this pull request), I got the following error (not going to paste the entire output here, just the part that looks relevant):

Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
    workflow.run() 
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
Traceback (most recent call last):
  File "bsubtilis_bbsr_workflow_runner.py", line 10, in <module>
    workflow.run() 
    workflow.run() 
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
  File "/home/kmt331/inferelator_ng/inferelator_ng/bbsr_tfa_workflow.py", line 49, in run
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
    (self.clr_matrix, self.mi_matrix) = self.mi_clr_driver.run(X, Y)
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
  File "/home/kmt331/inferelator_ng/inferelator_ng/mi_R.py", line 83, in run
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    matrix_data_frame = pd.read_csv(matrix_path, sep='\t')
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
    return _read(filepath_or_buffer, kwds)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
    return _read(filepath_or_buffer, kwds)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 400, in _read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    data = parser.read()
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 938, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.
1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    ret = self._engine.read(nrows)
  File "/share/apps/python/2.7.12/intel/lib/python2.7/site-packages/pandas-0.19.1-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1507, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9935)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10193)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pa
ndas/parser.c:10921)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10921)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10921)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10792)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55612)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55614)
2018-05-29 18:44:01,710 INFO     kvs            : Closing connection from ('172.16.2.127', 55610)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25929)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 240 fields in line 1529, saw 281

Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
srun: error: c41-06: tasks 5-7: Exited with exit code 1
srun: Terminating job step 6497421.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 6497421.0 ON c41-04 CANCELLED AT 2018-05-29T18:44:01 ***
2018-05-29 18:44:01,881 INFO     kvs            : Closing connection from ('172.16.2.129', 55022)
2018-05-29 18:44:01,890 INFO     kvs            : Closing connection from ('172.16.2.127', 55616)

... etc... ...

srun: error: c41-04: tasks 0-3: Killed
srun: error: c41-12: tasks 8-11: Killed
Traceback (most recent call last):
2018-05-29 18:44:02,022 INFO     kvs            : Server shutting down
  File "/home/kmt331/inferelator_ng/kvsstcp/kvsstcp.py", line 605, in <module>
    subprocess.check_call(args.execcmd, shell=True, env=t.env())
  File "/share/apps/python/2.7.12/intel/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'srun -n 12 python bsubtilis_bbsr_workflow_runner.py' returned non-zero exit status 1

real    0m15.581s
user    0m0.054s
sys     0m0.049s

kostyat commented 6 years ago

@nickdeveaux any ideas why i'm getting that error?

kostyat commented 6 years ago

Has anybody else tried this? Does it for for anyone else? I am still getting the same error on NYU HPC. This time I was working on the InfereCLaDR branch and I put in the same changes that you did into bbsr_tfa_runner.py manually, and I still got the same error.

simonsfoundation / inferelator_ng

Ndv dont share mi clr but still lock per bootstrap #63

Codecov Report