igorcantele / Internship

0 stars 0 forks source link

Saving data and "offline analysis" #11

Closed Helveg closed 3 years ago

Helveg commented 3 years ago

Unless things changed recently your code runs the simulation, then analyzes it and discards the data. This is called "online analysis" and is good for high throughput of "mature pipelines", but when the cost of running simulations is high and you are still experimenting with the analysis you'd be better off saving the data from the simulations, and then performing iterative improvements of your analysis on that dataset. This is called "offline analysis" and means that you can run the simulations once, 7 hours, and then store all your data, then running the analysis scripts as much as you want in only a few seconds each time. If the analysis fails you improve the scripts and try again!!

So, could you try to improve your codebase to factor out simulation and analysis? You can store data using pickle or np.savetxt! I think pickling might give you more freedom to experiment. So try to set up 2 experimental scripts like this:

import pickle

with open("save.pkl", "wb") as f:
  data = {"hello": "I", "am": "Igor"}
  pickle.dump(data, f)
import pickle

with open("save.pkl", "rb") as f:
  data = pickle.load(f)
  print(data)

Then take a look at how modules work in Python, and try to refactor your simulation code into an importable module so that you can do something like this:

import pickle
from my_simulations import run_simulation

with open("save.pkl", "wb") as f:
  data = run_simulation()
  pickle.dump(data, f)
igorcantele commented 3 years ago

I tried to use np.savetxt, but here I have some problems when saving array with different lenghts in a single data frame, for this reason I temporarily removed it from the script.

Helveg commented 3 years ago

Use pickle please :) You can store an arbitrary data structure eg:

import pickle

with open("save.pkl", "wb") as f:
  _100_results = [run_simulation() for i in range(100)]
  pickle.dump(_100_results , f)