materialsintelligence / propnet

A knowledge graph for Materials Science.
Other
72 stars 23 forks source link

(To discuss) How to handle array data (e.g. site props/co-ordination numbers) #48

Open mkhorton opened 6 years ago

mkhorton commented 6 years ago

Have thought about this some more ... the question can be summarized as, for example, 'how do I store x-y data, e.g. a measurement against temperature?'

I think the easy answer here is that symbols (e.g. which are containers for unit/value/symbol definitions) can store lists rather than just single values, which is quite a minimal change. Whether we use a dataframe to store these lists or just use plain Python lists, I don't know. I imagine most of these lists will be very small, so I'd be wary of over-engineering it at this stage, but dataframes are the obvious answer down the road if we find ourselves with larger data sets.

Will prototype this change to make it clearer what I'm talking about + for feedback.

dmrdjenovich commented 6 years ago

One way to do this currently I think would be to make a custom model that inputs temperature and outputs the property. Perhaps it would create a cubic spline over the data points or similar.

I think I'm confused about is how x-y data differs from a model? i.e. when you have a graph of symbols that, in a sense, forms an implicit table of properties?

But I'm definitely open to hearing what you have in mind. Being able to store more general data -- ie. the sites dictionary / coordination numbers -- would be useful! It will also require custom Model.evaluate() methods to handle each different type of information, but I think it's doable as you described.

mkhorton commented 6 years ago

Yeah, agreed that this how things are currently set up. It was the following questions that made me think about this:

  1. How do you input a property that depends on another property? e.g. lattice parameter or band gap at finite temperature

The idea here was that we would have 'dependent properties' or 'associated properties' or whatever we want to call them. Whatever we do, the point is that we need what we need an object that can store multiple values.

  1. What if we have a big series of this data, e.g. many lattice parameters at many band gaps? Do we then have dozens and dozens of symbols? This is fine, though a little heavy.

  2. What if we have a model that wants to accept a series of identical data points like this?

  3. How about data (like co-ordination number, say) that necessarily has multiple data points to even be meaningful, but where the number of data points varies depending on the crystallographic structure.

  4. Maybe 'datum' or 'datapoint' or 'data' might be a better name for Symbol with this in mind. Maybe it could then store, essentially, tabular data -- either a single value which is as it currently works, or a row of values (e.g. one row is one lattice parameter and one temperature), or a series of rows.

I'm not proposing any major change to the current machinery, just trying to think about how we're going to tackle these more complex cases while keeping the simple ones simple.

dmrdjenovich commented 6 years ago

I agree with you.

It will take a bit of effort to deal with different data types in the Symbol.values field -- it will be tedious but the changes should be small.

Right now the way evaluation works, if the type is unexpected I'd bet $100 we'll crash. So more work will definitely be needed to sufficiently generalize things.

dmrdjenovich commented 6 years ago

I took a second to look over evaluate -- so it will probably not have any problems if we change the type of Symbol.value. It only attempts to unwrap if the input is a pint.Quantity object.

The areas I'd be worried about is how will Sympy interaction look like for more complex data structures when passed in to the model.plug_in() method?

Anyways, I don't see things being too difficult to generalize.

mkhorton commented 5 years ago

@montoyjh's comments:

Measurement (or simulation) conditions might be placed into the provenance, if you're just looking for a reasonable place to carry them around, but I think given that they'll have somewhat different functionality it might be better to put them in a different place. I'd be a bit careful with this though, I think whatever you choose for implementation will have non-trivial consequences for graph evaluation, etc. Thermochemistry is a big part of the materials science model space, though, and this seems like a necessary step towards realizing knowledge of that space in propnet.

Also, I don't know if this is the right place to mention this, but in terms of having multiple conditions that correspond to the same material, I think I'm more in favor of doing something involving arrays, since they'll presumably be compatible with a lot of EquationModel logic.

To clarify my data frame comment, I was thinking a pandas DataFrame, where the column headers are symbol names, and you store actual Quantity objects in each cell, where each row defines a list of associated Quantities (e.g. condition) and each row is wholly independent of each other row.

Keeping it in provenance might also make (more) sense, depending on the provenance structure.

Honestly, I don't have a strong preference as long as the end-user API is clean and it scales to eg a thousand conditions at a time so we can eg build quick graphs in the web app. The number of models with conditions is going to be >> the number of models without conditions, so this really needs to get implemented.