thaler-lab / EnergyFlow

Python package for the EnergyFlow suite of tools.
https://energyflow.network
Other
40 stars 31 forks source link

Loading MOD datasets results in unsustainable memory consumption #11

Closed mattleblanc closed 4 years ago

mattleblanc commented 5 years ago

Hi,

Attempting to download the MOD datasets using the built-in ef.datasets.mod.load function (as below) results in excess memory consumption, which kills my process on the machine I'm currently using.

I could grab the files myself from zenodo, but it seems possible that this is unintended behaviour and so I am reporting it here.

🍻 MLB

ef.datasets.mod.load( amount=1.0,
                      cache_dir='/faxbox2/user/mleblanc/energyflow',
                      collection='CMS2011AJets',
                      dataset='sim',
                      subdatasets=None,
                      validate_files=False,
                      store_pfcs=True,
                      store_gens=True,
                      verbose=1)

ef.datasets.mod.load( amount=1.0,
                      cache_dir='/faxbox2/user/mleblanc/energyflow',
                      collection='CMS2011AJets',
                      dataset='gen',
                      subdatasets=None,
                      validate_files=False,
                      store_pfcs=True,e
                      store_gens=True,
                      verbose=1)

ef.datasets.mod.load( amount=1.0,
                      cache_dir='/faxbox2/user/mleblanc/energyflow',
                      collection='CMS2011AJets',
                      dataset='cms',
                      subdatasets=None,
                      validate_files=False,
                      store_pfcs=True,
                      store_gens=True,
                      verbose=1)
pkomiske commented 5 years ago

Hi Matt,

What do you mean by “excess” memory consumption? Looks like you’re trying to grab all of the sim and gen files, which are something like 120 GB compressed and even more uncompressed. I’ve made an effort to try and avoid duplicating large arrays in the code (0.13.1 improves this over 0.13.0) but python does not make memory management easy. Let me know if you think there are specific ways that the code is using memory poorly.

If you just want to download the files using the built-in energyflow functionality, setting store_pfcs and store_gens to False should allow that to happen without using too much memory.

Hope this helps.

Patrick