pyiron / pyiron_atomistics

pyiron_atomistics - an integrated development environment (IDE) for atomistic simulation in computational materials science.
https://pyiron-atomistics.readthedocs.io
BSD 3-Clause "New" or "Revised" License
43 stars 15 forks source link

Pyiron atomistics table funct.py questions and usage #844

Open ligerzero-ai opened 1 year ago

ligerzero-ai commented 1 year ago

https://github.com/pyiron/pyiron_atomistics/blob/master/pyiron_atomistics/table/funct.py

I'm wondering why these functions all return a dictionary key-pair? What is the intended form of usage for these functions?

e.g. when I am creating a table, I would use functions that return directly a value/string/list.

pr2 = Project( "%s/Ni-PureGBs-noGammaShift" % os.getcwd())

GB_nogamma_table = pr2.create_table()

def db_filter_function(job_table):
    return (job_table.status == "finished")

GB_nogamma_table.db_filter_function = db_filter_function

GB_nogamma_table.add["job_name"] = lambda x: x.job_name
GB_nogamma_table.add["total_energy"] = lambda x: x["output/generic/energy_tot"][-1]
GB_nogamma_table.add["forces"] = lambda x: x['output/generic/forces'][-1]
GB_nogamma_table.add["stresses"] = lambda x: x['output/generic/stresses'][-1]

GB_nogamma_table.run(delete_existing_job = True)
df_no_shift = GB_nogamma_table.get_dataframe()

Using the inbuilt functions:

import pyiron_atomistics.table.datamining as pi_table_funcs
pr2 = Project( "%s/Ni-PureGBs-noGammaShift" % os.getcwd())

GB_nogamma_table = pr2.create_table()

def db_filter_function(job_table):
    return (job_table.status == "finished")

GB_nogamma_table.db_filter_function = db_filter_function

GB_nogamma_table.add["job_name"] = lambda x: x.job_name
GB_nogamma_table.add["total_energy"] = pi_table_funcs.get_energy_tot
GB_nogamma_table.add["forces"] = pi_table_funcs.get_forces
GB_nogamma_table.add["stresses"] = pi_table_funcs.get_stresses

GB_nogamma_table.run(delete_existing_job = True)
df_no_shift = GB_nogamma_table.get_dataframe()

df_no_shift now has dictionaries in every column with key-pair values. I would think that this is not the intended use case. What exactly is the intended use case for this? Are we supposed to be using these functions to build dictionaries with key-pair values and just build a dataframe from the concatenated dictionary?

pd.DataFrame(concatenated_dict_list)

pmrv commented 1 year ago

It's undocumented and highly non-intuitive, but you're not actually supposed to add the pre-defined function manually back into the table. Instead you should do

tab.add.get_energy_pot
tab.add.get_elements
...

just accessing the elements adds the functions to the table. This is not written anywhere and doesn't make sense to the uninitiated. It's knowledge that can only be gleaned by studying the S'urce by oneself. Now having added the functions this way, them returning dictionaries instead of straight values causes the table to create multiple columns from one function (one for each key in the dictionary). If you look at the code doing tab.add['foo'] = bar will actually add a function lambda j: {'foo': bar(j)} into the table, inline with this mechanism above.

There's a lot I think that should change, but that I never got around to:

  1. Do away with the access also adds syntax, do tab.add.energy_pot() or something to do the addition
  2. Allow custom functions to add multiple columns at once, syntax needs some discussion, maybe table.add['foo', 'bar']?
  3. It'd be cool do define columns calculated from other columns. You can always do afterwards df['new'] = df.old.map(function), but if I could do it in one place it'd make me a bit happier.
ligerzero-ai commented 1 year ago

Okay, I'll leave this issue open as a reminder to look at it at some point to finish the implementation.

I think that it feels better/more elegant to call:

pr.create_table(properties = [job_name, energy_pot, energy_tot, ...])

And the docstring contains an example/usage pointer to: pr.create_table.prop_list which returns a list of strings that can be added to the list to generate the table.

So by default we can have pr.create_table(properties = self.prop_list).

rather than having a whole bunch of tab.add lines which is the current unfinished implementation.

What do you think?