madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Question for Olivier: does run.sh support more than one CPU core? Can it? Should it? #1001

Open valassi opened 6 days ago

valassi commented 6 days ago

Question for Olivier: does run.sh support more than one CPU core? Can it? Should it?

Hi @oliviermattelaer just a question for you. This is from bits and pieces of skype with @choij1589 @Saptaparna @roiser

I am trying to understand if run.sh uses more than one core. This does not seem to be the case. The gridpack I have (slightly modified with profiling, but still) has

    def launch(self, nb_event, seed):
        """ launch the generation for the grid """
        print("__CUDACPP_DEBUG: GridPackCmd.launch starting")
        cudacpp_start = time.perf_counter()
        # 1) Restore the default data
        print("__CUDACPP_DEBUG: GridPackCmd.launch (1) restore_data")
        logger.info('generate %s events' % nb_event)
        logger.info('nb_core = %s' % self.options['nb_core'])
        self.set_run_name('GridRun_%s' % seed)
        if not self.readonly:
            self.update_status('restoring default data', level=None)
            misc.call([pjoin(self.me_dir,'bin','internal','restore_data'),
                         'default'], cwd=self.me_dir)

        if self.run_card['python_seed'] == -2:
            import random
            if not hasattr(random, 'mg_seedset'):
                random.seed(seed)  
                random.mg_seedset = seed
        elif self.run_card['python_seed'] > 0:
            import random
            if not hasattr(random, 'mg_seedset'):
                random.seed(self.run_card['python_seed'])  
                random.mg_seedset = self.run_card['python_seed']         
        # 2) Run the refine for the grid
        print("__CUDACPP_DEBUG: GridPackCmd.launch (2) refine4grid")
        self.update_status('Generating Events', level=None)
        #misc.call([pjoin(self.me_dir,'bin','refine4grid'),
        #                str(nb_event), '0', 'Madevent','1','GridRun_%s' % seed],
        #                cwd=self.me_dir)
        self.refine4grid(nb_event)
...
    def refine4grid(self, nb_event):
        """Special refine for gridpack run."""
        print("__CUDACPP_DEBUG: GridPackCmd.refine4grid starting")
        cudacpp_start = time.perf_counter()
        self.nb_refine += 1

        precision = nb_event

        self.opts = dict([(key,value[1]) for (key,value) in \
                          self._survey_options.items()])

        # initialize / remove lhapdf mode
        # self.configure_directory() # All this has been done before
        self.cluster_mode = 0 # force single machine

In other words, I have the impression that self.cluster_mode = 0 hardcodes the use of only one core?

Can you confirm (or otherwise clarify) please? Thanks

oliviermattelaer commented 6 days ago

Yes gridpack are running on a single core by design (and the code is optimised based on that feature)

valassi commented 6 days ago

Thanks Olivier!

Replying on some of the discussion on skype (about my 'should it?' question): we could rethink this and submit multi core generation.

IMO anyway it also depends on the experiments... 4 single core jobs (even against a shared GPU) are in principle not less efficient than 1 4-core job (provided you really launch 4 jobs). For CMS I understand that CMS typically used to have GEN-SIM jobs, where the SIM multicore part was waiting a on a single core GEN part (then I am not sure if this changed, maybe splitting GEN and SIM), so there the inefficiency would in any case be that a multicore slot is used with (waht is known to be) a single core executable

oliviermattelaer commented 6 days ago

rethink the strategy? yes we should. Change the strategy? that's we need to think about...

For the moment, CMS is using (afaik) a read-only gridpack such that it launch (typically 8) gridpack executable in parralel within the same job allocation (which is asking for 8 core).

With our current work we do have multiple way to move forward from that situation:

  1. allow for an openmp execution of the gripack (so that mode would still be one executable but using all available thread) --no real code change needed here--
  2. keep the code as is, and use the same readonly framework to hit the GPU multiple times here --need some validation to check that readonly is working with GPU--
  3. change the algorithm to lift the requirement on single executable
roiser commented 6 days ago

I think it may be interesting to re-think is that e.g. in an HPC environment you get allocated a N GPUs and IIUC those are solely available to your job, one could then of course start N*M gridpacks within the same job (pilot) submission but this may then become difficult for the data management afterwards ...