Uniquely identifying derivation pathways/provenance for featurization

materialsintelligence / propnet

A knowledge graph for Materials Science.

Other

74 stars 23 forks source link

Uniquely identifying derivation pathways/provenance for featurization #191

Open JosephMontoya-TRI opened 5 years ago

JosephMontoya-TRI commented 5 years ago

I have a keen interest in making a featurizer that uses propnet-derived features, but I'm not sure how to create an identifier for every Quantity that contains the information for its symbol+evaluation pathway (which I'd want to separate to maximize my feature set. I think a provenance could probably be meaningfully hashed, but I'm not sure how to do it off of the top of my head.

clegaspi commented 5 years ago

As of December, or so, every propnet quantity is assigned a unique ID when it is created (it's a random uuid). It was intended to be used as a bookkeeping mechanism so that we wouldn't have to save the values of quantities in provenance trees, but instead refer to the quantity object by ID.

These IDs may be sufficiently unique for your featurizer, although they alone do not hold information about provenance.

With the new PR, the hash value of a quantity will take into account provenance, although it does not guarantee equality because it doesn't hash the value.

montoyjh commented 5 years ago

Right, I could certainly distinguish among the quantities generated for a single material using that. What I'm saying is that I want to be able to identify distinct quantities that were derived in exactly the same way for a set of multiple materials, so I can use them as features corresponding to a dataset.

For example, I might get 50 vicker's hardnesses per material with the standard MP dataset. If I want to use these as features, I'd like to be able to put them into columns that correspond to "identical" features, which in my mind corresponds to the derivation path.

clegaspi commented 5 years ago

Oh, I see what you're getting at. Hmm, yeah it's not immediately obvious to me how to do that either. I imagine you'd have to hash the whole model tree in some deterministic way.

JosephMontoya-TRI commented 5 years ago

Yeah, that's what I was thinking too. It might be an interesting idea to do that for other reasons as well. For example, graph evaluation might be really facile if you could "cache" the action of the graph for datasets that are isomorphic, which I think might be easier than doing the logic of graph evaluation every time.

clegaspi commented 5 years ago

@dmrdjenovich Do you have any thoughts about this? Since you were just working with tree traversal.