related-sciences / nxontology-ml

Machine learning to classify ontology nodes
Apache License 2.0
6 stars 0 forks source link

Ease NXOntology serialization #36

Closed yonromai closed 11 months ago

yonromai commented 11 months ago

Context: I'm currently working on serializing the trained model. Since the feature transformers have learnt state, they also need to be serialized, along with the actual model. I'm currently making sure that all objects whose references are held by feature transformers can be serialized reasonably well. This includes NXOntology (which reference is currently held by PrepareNodeFeatures).

@dhimmel I'd like your input on 2 aspects:

  1. Do we want to serialize the exact version of the ontology when we freeze the feature pipeline?
    • Cons:
      • Would make it hard to predict on a version of the ontology which is different than the one trained on
    • Pros:
      • Some properties derived of the graph state need to end up serialized (e.g. PCA weights post "fitting")
      • More robust (e.g. avoid "schema" change issues across versions, training/predict data drift, ...)
      • The node_info call becomes part of the feature pipeline / coupled
  2. If we do want to serialize the version of the ontology as part of the feature pipeline, are you cool with the approach below? (gist: Instead of saving the whole graph, assume the graph is backed by an immutable url and only save the url).

A quick note: The PR doesn't make use of HostedNXOntology yet, but the idea it to ultimately replace the usage of NXOntology[str] by HostedNXOntology wherever it's serialized.

cc: @ravwojdyla, in case you have opinions.

dhimmel commented 11 months ago

Do we want to serialize the exact version of the ontology when we freeze the feature pipeline?

In the context of nxontology-data, it is likely that we will store a local file with the nxontology prior to the machine learning. Such that you will read it locally and not from a URL. There's a chance we can just pass it in via memory and only serialize after additional node attributes are added with the predictions. All parts of the ontology that are used to derive features will not be modified.

My intuition would be to not serialize the nxontology here and leave that up to the user. Just serialize the model. Perhaps there is something I am missing.

I don't think its absolutely necessary for all learnt state to be serialized especially if its generation is deterministic. I think the main goal is to be able to inspect models or parameters should questions arise.

If easy to serialize everything, then do it. But don't think we should invest in large amounts of infrastructure to do so if its not straightforward.

yonromai commented 11 months ago

Interesting, thanks for your inputs @dhimmel. I guess we should first gain clarity regarding the reasons why we want to serialize the model in the first place.

In the context of nxontology-data, it is likely that we will store a local file with the nxontology prior to the machine learning. Such that you will read it locally and not from a URL.

That's good to know!

There's a chance we can just pass it in via memory and only serialize after additional node attributes are added with the predictions. All parts of the ontology that are used to derive features will not be modified.

This would likely invalidate the serialization assumption made in the PR (unlike relying on a stable url / file path).

My intuition would be to not serialize the nxontology here and leave that up to the user. Just serialize the model. Perhaps there is something I am missing. I don't think its absolutely necessary for all learnt state to be serialized especially if its generation is deterministic. I think the main goal is to be able to inspect models or parameters should questions arise.

A few observations:

If easy to serialize everything, then do it. But don't think we should invest in large amounts of infrastructure to do so if its not straightforward.

Making all feature transformers serializable is not a huge amount of work, but it does make sense to check whether serialize the model + features is really useful in the first place.

yonromai commented 11 months ago

How about this: If we aren't sure yet about how/why a serialized model would be actually used, perhaps we should not serialize it for now and wait for an actual use case to pop up?

dhimmel commented 11 months ago

If we aren't sure yet about how/why a serialized model would be actually used, perhaps we should not serialize it for now and wait for an actual use case to pop up

Sounds good. I'll add a comment at https://github.com/related-sciences/nxontology-ml/issues/35#issuecomment-1740984286 on what the best next steps are