Ease NXOntology serialization

yonromai commented 11 months ago

Context: I'm currently working on serializing the trained model. Since the feature transformers have learnt state, they also need to be serialized, along with the actual model. I'm currently making sure that all objects whose references are held by feature transformers can be serialized reasonably well. This includes NXOntology (which reference is currently held by PrepareNodeFeatures).

@dhimmel I'd like your input on 2 aspects:

Do we want to serialize the exact version of the ontology when we freeze the feature pipeline?
- Cons:
  - Would make it hard to predict on a version of the ontology which is different than the one trained on
- Pros:
  - Some properties derived of the graph state need to end up serialized (e.g. PCA weights post "fitting")
  - More robust (e.g. avoid "schema" change issues across versions, training/predict data drift, ...)
  - The node_info call becomes part of the feature pipeline / coupled
If we do want to serialize the version of the ontology as part of the feature pipeline, are you cool with the approach below? (gist: Instead of saving the whole graph, assume the graph is backed by an immutable url and only save the url).

A quick note: The PR doesn't make use of HostedNXOntology yet, but the idea it to ultimately replace the usage of NXOntology[str] by HostedNXOntology wherever it's serialized.

cc: @ravwojdyla, in case you have opinions.

dhimmel commented 11 months ago

Do we want to serialize the exact version of the ontology when we freeze the feature pipeline?

In the context of nxontology-data, it is likely that we will store a local file with the nxontology prior to the machine learning. Such that you will read it locally and not from a URL. There's a chance we can just pass it in via memory and only serialize after additional node attributes are added with the predictions. All parts of the ontology that are used to derive features will not be modified.

My intuition would be to not serialize the nxontology here and leave that up to the user. Just serialize the model. Perhaps there is something I am missing.

I don't think its absolutely necessary for all learnt state to be serialized especially if its generation is deterministic. I think the main goal is to be able to inspect models or parameters should questions arise.

If easy to serialize everything, then do it. But don't think we should invest in large amounts of infrastructure to do so if its not straightforward.

yonromai commented 11 months ago

Interesting, thanks for your inputs @dhimmel. I guess we should first gain clarity regarding the reasons why we want to serialize the model in the first place.

In the context of nxontology-data, it is likely that we will store a local file with the nxontology prior to the machine learning. Such that you will read it locally and not from a URL.

That's good to know!

There's a chance we can just pass it in via memory and only serialize after additional node attributes are added with the predictions. All parts of the ontology that are used to derive features will not be modified.

This would likely invalidate the serialization assumption made in the PR (unlike relying on a stable url / file path).

My intuition would be to not serialize the nxontology here and leave that up to the user. Just serialize the model. Perhaps there is something I am missing. I don't think its absolutely necessary for all learnt state to be serialized especially if its generation is deterministic. I think the main goal is to be able to inspect models or parameters should questions arise.

A few observations:

I don't think the learnt parameters for the feature transformers are more deterministic than the model's parameters. (e.g. The vectors used for embeddings are stochastically sampled from BERT (and fitted for PCA), akin to the learnt catboost weights during stochastic gradient descent)
In my mind, the parameters learnt while "training" the feature transformers are logically identical to the weights learnt while training the model. (i.e. I don't really understand a scenario where we'd want to save one but not the other). In fact, many ML production systems will incentivize sealing the feature transformation logic with the model and treat them as 1 unit (e.g. Sklearn's pipeline, Google's TFX)
A trained model cannot be used to make predictions without a trained feature pipeline to go along with it. Ideally the model and features pipelines should be trained on the same data distribution. I guess in the case were we snapshot the model input data, we can do without the feature transformation - but I don't really understand the rationale here.
Even thought the model isn't strictly deterministic, it's likely "epsilon deterministic" (i.e. 2 independent training invocations of features+model would be very close, within a reasonably small epsilon).
If model inspection is the goal, we could only save training metrics + feature importance instead of all the weights of model + features (that's the approach I took to track experimental results).

If easy to serialize everything, then do it. But don't think we should invest in large amounts of infrastructure to do so if its not straightforward.

Making all feature transformers serializable is not a huge amount of work, but it does make sense to check whether serialize the model + features is really useful in the first place.

yonromai commented 11 months ago

How about this: If we aren't sure yet about how/why a serialized model would be actually used, perhaps we should not serialize it for now and wait for an actual use case to pop up?

dhimmel commented 11 months ago

If we aren't sure yet about how/why a serialized model would be actually used, perhaps we should not serialize it for now and wait for an actual use case to pop up

Sounds good. I'll add a comment at https://github.com/related-sciences/nxontology-ml/issues/35#issuecomment-1740984286 on what the best next steps are

related-sciences / nxontology-ml

Ease NXOntology serialization #36