Unexpected memory usage when scaling up via num_generation_jobs

khurtado commented 3 years ago

Hello,

This was discussed via slack at some point, so I just wanted to open an issue so this is not forgotten. When scaling up a workflow via num_generation_jobs, the number of jobs in the physics stage increases properly, but the memory per job also considerably increases per job.

E.g.: If num_generation_jobs is increased by a factor 10 (from 6 to 60), memory usage per delphes job goes from ~700 MB to 7 GB e.g.:

num_generation_jobs: 6
12        733         /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d7-6878-4e6a-b991-72529e19aeed madminertool/madminer-workflow-ph:0.3.0 sh -c '/madminer/scripts/4_delphes.sh -p /madminer -m software/MG5_aMC_v2_9_3 -c /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d
7-6878-4e6a-b991-72529e19aeed/workflow_ph/configure/data/madminer_config.h5 -i /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d7-6878-4e6a-b991-72529e19aeed/ph/input.yml -e /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d7-6878-4e6a-b991-72529e19aeed/workflow_ph/pythia
_0/events/Events.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d7-6878-4e6a-b991-72529e19aeed/workflow_ph/delphes_0'```

num_generation_jobs: 60 122 7325 /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90 madminertool/madminer-workflow-ph:0.3.0 sh -c '/madminer/scripts/4_delphes.sh -p /madminer -m software/MG5_aMC_v2_9_3 -c /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/configure/data/madminer_config.h5 -i /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/ph/input.yml -e /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/pythia_33/events/Events.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/delphes_33'



This will put a limit on the factor the madminer workflow can be scaled up to, as this will be linked to the memory available per worker in the cluster we are submitting to.

Is this something that is understood or does it need some investigation?
Is this something that require any fix?

Sinclert commented 3 years ago

Hi @khurtado ,

Thanks for the debugging efforts 😄

In principle, that seems like an undesired behaviour. I said "in principle" because I do not fully understand how Pythia and Delphes makes use of the scaling factor (a.k.a number of jobs the workflow needs * N) internally.

In theory, each parallel job computes a set of values for a given benchmark (sm, w...) so it makes sense to compute them in parallel. In this scenario, increasing the number of jobs so that there are more than one job per benchmark, is a way to parallelize the computation of every single benchmark on its own. I am unsure if Pythia / Delphes are prepared to handle this, and if so, how it is done.

If you could confirm that this internal parallelization of benchmark-based computed values makes sense, and it is done correctly, then we could start debugging the memory consumption of each job.

As an initial hint, I always found this particular code snippet a bit funny. Bear in mind it is a rewrite from its older version, which generated a similar list. Maybe @irinaespejo knows where this code snippet comes from.

irinaespejo commented 3 years ago

Hi @khurtado,

Thanks for the update. I think @Sinclert 's intuition is right. We need to investigate how to parallelize the jobs that have the same benchmark within a Pythia+Delphes step (so 6 times) instead of calling Pythia+Delphes 6*n_jobs times. I'm looking into the snippet. Luckily, Delphes is on github delphes/delphes and Pythia alisw/pythia8 so we can ask the developer team.

irinaespejo commented 3 years ago

Hi @khurtado, is there a way we can access the cluster you are using for debugging purposes? thank you!

khurtado commented 3 years ago

@irinaespejo Yes, let's discuss via slack

irinaespejo commented 3 years ago

Hi all,

@Sinclert and I discussed a solution offline and I'll write it here for the record:

proposed solution

The problem of this issue is that the madminer-workflow, particularly the Pythia and Delphes steps do no scale well. Right now, we control the number of jobs by an external parameter called num_generation_jobs (here) i.e. the number of arrows (or jobs) leaving the generate step in the current architecture is num_generation_jobs. Each arrow leaving the generate step will make computations according to the distribution of the benchmarks which is controlled by this snippet. _This means a Pythia and a Delphes instance is called num_generation_jobs times. Which could be a cause for the bad scalability._

Instead, we propose a subtle change in the architecture of the workflow. The number of arrows (jobs) leaving the generate step will be num_benchmarks and not num_generation_jobs. The each arrow will pass num_jobs to the Pythia and Delphes state. We hope that Delphes and Pythia will know how to internally parallelize a big chunk of jobs. Maybe @khurtado can comment on this Delphes/Pythia internal parallelization.

The num_benchmarks depends on the user-specified benchmarks here and on morphing max_overall_power

Changes to make:

[ ] 2_generate.sh change num_generation_jobs for num_benchmarks
[ ] generate.py remove snippet and replace by distribution of jobs for each arrow (benchmark). This is unclear how to do.

(please do not hesitate to update the to-do list in the comments below)

Non-solved questions about the proposed solution

[ ] Crank up scalability using events instead of jobs

khurtado commented 3 years ago

This makes sense and sounds good to me!
I don't know much about the internal parallelization details on Delphes/Pythia unfortunately, so I can't comment on that.

Please, let me know once changes are done and I would be happy to test (or if I can help with anything besides testing).

Sinclert commented 3 years ago

After a bit of research, it seems that MadGraph (the pseudo-engine used to run Pythia and Delphes), have an optional argument called run_mode (MadGraph forum comment).

This could be used to specify:

👎🏻 run_mode=0: single core (no parallelization).
👎🏻 run_mode=1: cluster mode (not useful, as we are relying on REANA to deal with back-ends).
⭐ run_mode=2: multi-core (process-based parallelization).

Sadly, I could not find an official reference to this argument, so not sure if the accepted values have changed on modern versions of MadGraph (2.9.X and 3.X.X). In any case, this would be the "last piece" to migrate:

From: _split num_jobs among M benchmarks_.
To: _assign num_jobs to each of the M benchmarks_.

irinaespejo commented 3 years ago

Wow that's interesting. Maybe just assigning run_mode=2 with the current architecture is able to scale. I'll try it and get back to you tomorrow.

khurtado commented 3 years ago

@Sinclert the options for run_mode seem to be the same in modern versions of Madgraph:

https://bazaar.launchpad.net/~madteam/mg5amcnlo/3.x/view/head:/Template/LO/README#L80

Sinclert commented 3 years ago

@khurtado @irinaespejo

I have created a new branch, mg_process_parallelization, to implement the changes we discussed about. In principle, the Docker image coming from that branch (madminer-workflow-ph:0.5.0-test) should be able to parallelize the MadGraph steps of each benchmark.

In a nutshell:

The jobs-to-benchmark distribution snippet has been removed.
The shell script loop generating the .tar.gz folders now iterates on the number of benchmarks.
A me5_configuration.txt file has been added to the set of cards, with options:
- run_mode=2: to run in multi-core mode.
- nb_core=None: to assign as many processes as cores detected.

Bear in mind that the num_generation_jobs workflow-level parameter has not been removed, but it is currently useless, as we are setting the number of parallel processes, per benchmark, by the maximum number possible (usingnb_core=None).

Let me know if fine-tunning the number of processes per benchmark is something of interest.

Please, run the sub-workflow with the new Docker image (0.5.0-test), and compare the results with the old one (0.4.0).

irinaespejo commented 3 years ago

@sinclert wow nice, I was also working on this without success. Regarding point 3

A me5_configuration.txt file has been added to the set of cards, with options: run_mode=2: to run in multi-core mode. nb_core=None: to assign as many processes as cores detected.

When I uncommented # run_mode=2 and ran the workflow on yadage-run I saw that there where cards still with the uncommented # run_mode=2 begin created in the generate step ans transmitted to he pythia step.

I think the easiest solution to check whether we are really running on run_mode=2 is that @khurtado runs the branch mg-process-parallelization on the VT3 cluster and lets us know if the scalability issue is solved. @khurtado let us know right away of you run into trouble. Thank you!!

irinaespejo commented 3 years ago

Actually, since I have access to the cluster, I'm going to run the branch mg-process-parallelization workflow now

irinaespejo commented 3 years ago

Hi everyone,

The results from running scailfin/madminer-workflow-ph (mg-process-parallelization) on VC3:

Sanity checks:

The workflow finishes successfully (the status is running but all files of the steps are there)
The physics workflow indeed uses the branch code

Screenshot from 2021-09-01 14-33-18

Other checks:

The pythia stage preserves the change run_mode = 2introduced in the docker image here

The command grep -R "run_mode = 2" shows indeed that

./pythia_3/mg_processes/signal/Cards/me5_configuration.txt:run_mode = 2 ./pythia_3/mg_processes/signal/madminer/cards/me5_configuration_0.txt:run_mode = 2 ./delphes_3/extract/madminer/cards/me5_configuration_3.txt:run_mode = 2 (and all the other pythia and delphes steps) All good!

Now, scalability tests? Answering to Sinclert, yes we are interested in fine-tunning num of processes per benchmark.

irinaespejo commented 3 years ago

Memory usage results from running branch mg_process_parallelization

example of Delphes ClusterId MemoryUsage Args 217 318 /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2 madminertool/madminer-workflow-ph:0.5.0-test sh -c '/madminer/scripts/4_delphes.sh -p /madminer -m software/MG5_aMC_v2_9_4 -c /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/configure/data/madminer_config.h5 -i /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/ph/input.yml -e /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/pythia_4/events/Events.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/delphes_4'

example of Pythia: 210 196 /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2 madminertool/madminer-workflow-ph:0.5.0-test sh -c '/madminer/scripts/3_pythia.sh -p /madminer -m software/MG5_aMC_v2_9_4 -z /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/generate/folder_0.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/pythia_3'

irinaespejo commented 3 years ago

Hi @Sinclert, I've been testing the mg-process-parallelization branch on scailfin/workflow-madminer-ph. When running make yadage-run I found the following error

on the file .yadage/workflow_ph/generate/_packtivity/generate.run.log there's

2021-09-07 09:10:38,583 | pack.generate.run | INFO | starting file logging for topic: run 2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 0 sm' 2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 1 w' 2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 2 morphing_basis_vector_2' 2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 3 morphing_basis_vector_3' 2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 4 morphing_basis_vector_4' 2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 5 morphing_basis_vector_5' 2021-09-07 09:11:03,610 | pack.generate.run | INFO | b"sed: can't read s/nb_core = None/nb_core = 1/: No such file or directory"

This was solved by doing the following changes:

remove from this line in scripts/2_generate.sh the " ". The new line should look like sed -i \
re-build the madminer-workflow-ph image, in this case I named it madminertool/madminer-workflow-ph:0.5.0-test-2
update the image tag in steps.yml here
make yadage-run

The workflow finishes successfully now without any further errors.

Sinclert commented 3 years ago

Hi @irinaespejo ,

I included "" because of macOS compatibility. I thought it was a quick fix to make the script runnable both in macOS and Linux. It seems it did not work.

According to this StackOverflow post, we could achieve this by using the -e flag instead. Could you try the following snippet and confirm that it runs on Linux?

sed -i \
    -e "s/${default_spec}/${custom_spec}/" \
    "${SIGNAL_ABS_PATH}/madminer/cards/me5_configuration_${i}.txt"

irinaespejo commented 3 years ago

I just tested the snippet you posted and it runs successfully :heavy_check_mark: (my upload internet connection is pretty slow)

Sinclert commented 3 years ago

The PR changing the parallelization strategy (https://github.com/scailfin/madminer-workflow-ph/pull/11) has been merged.

We should be in a better spot to test the total time + memory consumption of each benchmark job.

Sinclert commented 3 years ago

Hi @khurtado and @irinaespejo,

Is there anything else to discuss within this issue? Have you tried the latest version of the workflow?

irinaespejo commented 3 years ago

The last version of the workflow ran succesfully after Kenyi did some fixing with the cluster permits. @khurtado how is the situation in the cluster to submit computationally intensive workflows? Can we just try? Thanks!!

khurtado commented 3 years ago

@irinaespejo Yes, the cluster should have workers to work with. I still need to fix the website certs, I will do that tomorrow.

Sinclert commented 3 years ago

Hi. I am closing this issue for now.

For future reporting of performance issues / configuration tweaks / etc, please, open a separate issue.

madminer-tool / madminer-workflow-ph

Unexpected memory usage when scaling up via num_generation_jobs #8

proposed solution