Integrating nxontology-ml predictions with the nxontology-data efo output

dhimmel commented 11 months ago

https://github.com/related-sciences/nxontology-ml/issues/30 concerned creating predictions for a recent version of EFO (v3.57.0).

Now we'd like to find a continuous/automated way to make predictions for the latest version of EFO OTAR Slim in nxontology-data. The goal is that whenever there is an EFO version in nxontology-data, the corresponding precision predictions are available.

Two options come to mind:

When the nxontology-data mesh workflow completes it could trigger an action in nxontology-ml.
nxontology-data could depend on nxontology-ml and when creating mesh, it could export the prediction + features dataset (probably as a json rather than tsv based on that repos conventions for serializing tables). This way the mesh output directories of nxontology-data would include the features + predictions. We could also update the graph serialization to include a precision attribute for nodes.

My preference is for 2 because it provides the tightest coupling of the input nxontology ontology and output predictions as well as providing the most user convenience. It does complicate dependencies, but hopefully no major issues there. @yonromai any thoughts?

Along with this change, we should ensure deterministic models / predictions. In addition, it would likely be good to serialize the model such that it could be inspected following a CI run.

yonromai commented 11 months ago

@dhimmel :

The goal is that whenever there is an EFO version in nxontology-data, the corresponding precision predictions are available.

👌

it could export the prediction + features dataset (probably as a json rather than tsv based on that repos conventions for serializing tables)

Sure thing. Just a note that avoiding a record type of serialization (with the column names duplicated per record) would probably reduce the resulting file size significantly (esp. if feature data is included).

My preference is for 2 because it provides the tightest coupling of the input nxontology ontology and output predictions as well as providing the most user convenience.

I agree that it makes sense to keep the data exports collocated

It does complicate dependencies, but hopefully no major issues there. @yonromai any thoughts?

Indeed, depending on how the ML code is invoked (e.g. poetry install vs. run in docker), we'd probably want to try to align the external dependencies of nxontology-data and nxontology-ml to avoid conflicts.

I don't foresee any large complication (I thought nxontology-ml currently had a dependency on nxontology-data but it only depends on nxontology so we're out of circular dependency problems here)

Along with this change, we should ensure deterministic models / predictions. In addition, it would likely be good to serialize the model such that it could be inspected following a CI run.

👍

Okay so I can work on the following:

Check that the dependencies of nxontology-ml don't conflict with nxontology-data and align them if necessary
Add a model training endpoint to the nxontology-ml CLI, for convenience
Check-in in git a trained version of the model (so only model inference is required in the nxontology-data release workflow)
Update nxontology-ml CLI to add a model inference endpoint that takes the new ontology as an input and outputs a json file with the labels. Can also output the features if you want. Post-processing of the output might be needed to integrate the labels to the file you'd like to output.

A few questions:

Do you have special requirements in terms of the input format of the updated ontology?
Could you explain the rationale behind officially exposing the feature data to users? Here is my main concert: At the moment the feature selection and implementation is a "private" detail of the model, which would be likely to evolve as we work on model improvements. Making it public and having clients rely on it might complicate evolution of the features. Does this make sense?
(In general: Any "specification" in terms of the input/output of the inference CLI endpoint would be useful)

dhimmel commented 11 months ago

depending on how the ML code is invoked (e.g. poetry install vs. run in docker), we'd probably want to try to align the external dependencies

hoping poetry will work.

avoiding a record type of serialization (with the column names duplicated per record) would probably reduce the resulting file size

nxontology-data has a write_dataframe function that we can use for output. Is recordwise but compresses so does not matter.

Check-in in git a trained version of the model (so only model inference is required in the nxontology-data release workflow)

I wonder if we need to refit the model, since feature values will depend on which EFO release it is trained on. What do you think?

yonromai commented 11 months ago

nxontology-data has a write_dataframe function that we can use for output. Is recordwise but compresses so does not matter.

We probably want to avoid having a 2 way dependency betweennxontology-ml and nxontology-data (circular dependency)? I guess that logic could be (1) duplicated or (2) moved to a common library

I wonder if we need to refit the model, since feature values will depend on which EFO release it is trained on. What do you think?

That's an interesting point. I was assuming that since the training node labels are static, the metadata of these same nodes was (relatively) static as well. We could train + predict for each data update, although that'd probably slow down the pipeline a bit.

Do you have special requirements in terms of the input format of the updated ontology?

@dhimmel do you have info about ^ (I need to know how the updated ontology would be passed to the model since ATM it's relying on astatic dump)

dhimmel commented 11 months ago

We probably want to avoid having a 2 way dependency

nxontology-ml can just return the pandas dataframe, nxontology-data can write it.

We could train + predict for each data update, although that'd probably slow down the pipeline a bit.

How slow is it? If less than 30 minutes, should be no problem. We can skip all experiments in terms of what feature sets to include.

I need to know how the updated ontology would be passed to the model since ATM it's relying on astatic dump

It will be the same format. Either a path to read the NXOntology from or possibly a NXOntology object.

dhimmel commented 11 months ago

Thinking of a path that allows us to get the predictions in nxontology-data ASAP:

create a function that takes and EFO nxontology object as input and returns the combined feature and prediction pandas.DataFrame. This function can also be given a path to write intermediate outputs (like serialized models), but this is a less of a priority. From here, I can make a PR to call the function from nxontology-data.
enable determinism, probably just a matter of setting the right random seeds. Weak determinism is okay meaning dependency version changes or platform changes might break the determinism. The main goal of determinism is to avoid generating different ouputs when the inputs haven't changed. To optimize storage and avoid unnecessary diffs in the output branch

yonromai commented 11 months ago

create a function that takes and EFO nxontology object as input and returns the combined feature and prediction pandas.DataFrame. This function can also be given a path to write intermediate outputs (like serialized models), but this is a less of a priority. From here, I can make a PR to call the function from nxontology-data.

This function should trigger both model training and prediction, right?

Note: I still don't understand the point of exporting the features. (More details in the Could you explain the rationale behind officially exposing the feature data to users? question above). That being said, I don't strictly need to understand the rationale, I can just do it if you want.

enable determinism, probably just a matter of setting the right random seeds. Weak determinism is okay meaning dependency version changes or platform changes might break the determinism. The main goal of determinism is to avoid generating different ouputs when the inputs haven't changed. To optimize storage and avoid unnecessary diffs in the output branch

👍 will look into it

dhimmel commented 11 months ago

Note: I still don't understand the point of exporting the features

It is nice to know what features were available to the model, especially if someone wants to try fitting another model. Also the features might be relevant for other prediction tasks and the barrier to entry is much lower for someone to grab a file than to run code to recreate something. But you are right that it is not absolutely necessary for the immediate use case. Only the predictions are necessary for that.

yonromai commented 11 months ago

I see. The main reason why I'm bothering you with these questions is that I wanted to have a discussion about scope (& scope creeping):

From what I understand, the original scope of this project was to expose an ML derived precision label to the users of nxontology-data. The features selection has been dictated purely by their usefulness in the best performing model so far.

Exposing the feature data to the users (and then potentially having to maintain the features based on usage, independent from model usefulness) is a small & subtle extension of scope. This change may be something we want & I don't personally know the users enough to know which scope is the most beneficial to the users. Just wanted to point that out.

dhimmel commented 11 months ago

I'm working on applying the nxontonlogy-ml classification in https://github.com/related-sciences/nxontology-data/pull/20.

I had to include the training labels in the distribution with https://github.com/related-sciences/nxontology-ml/pull/41 (we probably want to move that file into the package tree).

Here's some dev code I'm using:

from nxontology import NXOntology
from nxontology_ml.model.predict import train_predict as nxontology_ml_train_predict

url = "https://github.com/related-sciences/nxontology-data/raw/c03cdd9e6486ff9ff7f22f260562e6b36465b052/efo_otar_slim.json"
nxo = NXOntology.read_node_link_json(url)
nxo.freeze()  # edited comment to add this
precision_df = nxontology_ml_train_predict(nxo=nxo)

After 27 minutes, it threw the following error. @yonromai any insights into this error?

Expand for error trace

```python-traceback --------------------------------------------------------------------------- Exception Traceback (most recent call last) [/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/efo.ipynb](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/efo.ipynb) Cell 3 line 1 ----> [1](vscode-notebook-cell:/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/efo.ipynb#X62sZmlsZQ%3D%3D?line=0) precision_df = nxontology_ml_train_predict(nxo=nxo) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/model/predict.py:121](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/model/predict.py:121), in train_predict(conf, nxo, training_set, include_feature_values, train_take, predict_take) 117 training_set = training_set or read_training_data( 118 filter_out_non_disease=True, nxo=nxo, take=train_take 119 ) 120 nxo = nxo or get_efo_otar_slim() --> 121 feature_pipeline, model = train_model( 122 conf=conf, nxo=nxo, training_set=training_set, take=train_take 123 ) 124 return predict( 125 feature_pipeline=feature_pipeline, 126 model=model, (...) 130 take=predict_take, 131 ) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/model/train.py:48](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/model/train.py:48), in train_model(conf, nxo, training_set, take) 30 (X, y) = training_set or read_training_data( 31 filter_out_non_disease=True, nxo=nxo, take=take 32 ) 34 feature_pipeline: Pipeline = make_pipeline( 35 PrepareNodeFeatures(nxo=nxo), 36 NodeInfoFeatures(), (...) 46 CatBoostDataFormatter(), 47 ) ---> 48 X_transform = feature_pipeline.fit_transform(X, y) 49 model = CatBoostClassifier( 50 eval_metric=conf.get_eval_metric, 51 depth=conf.depth, (...) 56 random_seed=MODEL_SEED, 57 ) 58 model.fit( 59 X=Pool( 60 data=X_transform, 61 label=list(y), 62 ) 63 ) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/base.py:1152](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/base.py:1152), in _fit_context..decorator..wrapper(estimator, *args, **kwargs) 1145 estimator._validate_params() 1147 with config_context( 1148 skip_parameter_validation=( 1149 prefer_skip_nested_validation or global_skip_validation 1150 ) 1151 ): -> 1152 return fit_method(estimator, *args, **kwargs) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/pipeline.py:471](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/pipeline.py:471), in Pipeline.fit_transform(self, X, y, **fit_params) 444 """Fit the model and transform with the final estimator. 445 446 Fits all the transformers one after the other and transform the (...) 468 Transformed samples. 469 """ 470 fit_params_steps = self._check_fit_params(**fit_params) --> 471 Xt = self._fit(X, y, **fit_params_steps) 473 last_step = self._final_estimator 474 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)): File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/pipeline.py:377](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/pipeline.py:377), in Pipeline._fit(self, X, y, **fit_params_steps) 375 cloned_transformer = clone(transformer) 376 # Fit or load from cache the current transformer --> 377 X, fitted_transformer = fit_transform_one_cached( 378 cloned_transformer, 379 X, 380 y, 381 None, 382 message_clsname="Pipeline", 383 message=self._log_message(step_idx), 384 **fit_params_steps[name], 385 ) 386 # Replace the transformer of the step with the fitted 387 # transformer. This is necessary when loading the transformer 388 # from the cache. 389 self.steps[step_idx] = (name, fitted_transformer) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/joblib/memory.py:353](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/joblib/memory.py:353), in NotMemorizedFunc.__call__(self, *args, **kwargs) 352 def __call__(self, *args, **kwargs): --> 353 return self.func(*args, **kwargs) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/pipeline.py:957](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/pipeline.py:957), in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params) 955 with _print_elapsed_time(message_clsname, message): 956 if hasattr(transformer, "fit_transform"): --> 957 res = transformer.fit_transform(X, y, **fit_params) 958 else: 959 res = transformer.fit(X, y, **fit_params).transform(X) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/utils/_set_output.py:157](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/utils/_set_output.py:157), in _wrap_method_output..wrapped(self, X, *args, **kwargs) 155 @wraps(f) 156 def wrapped(self, X, *args, **kwargs): --> 157 data_to_wrap = f(self, X, *args, **kwargs) 158 if isinstance(data_to_wrap, tuple): 159 # only wrap the first output for cross decomposition 160 return_tuple = ( 161 _wrap_data_with_container(method, data_to_wrap[0], X, self), 162 *data_to_wrap[1:], 163 ) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/base.py:919](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/sklearn/base.py:919), in TransformerMixin.fit_transform(self, X, y, **fit_params) 916 return self.fit(X, **fit_params).transform(X) 917 else: 918 # fit method of arity 2 (supervised transformation) --> 919 return self.fit(X, y, **fit_params).transform(X) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/text_embeddings_transformer.py:48](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/text_embeddings_transformer.py:48), in TextEmbeddingsTransformer.fit(self, X, y, **fit_params) 46 self._lda.fit(self._nodes_to_vec(X), y) 47 if self._pca: ---> 48 self._pca.fit(self._nodes_to_vec(X)) 49 return self File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/text_embeddings_transformer.py:80](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/text_embeddings_transformer.py:80), in TextEmbeddingsTransformer._nodes_to_vec(self, X) 79 def _nodes_to_vec(self, X: NodeFeatures) -> np.ndarray: ---> 80 return np.array([self._embedding_model.embed_node(node) for node in X.nodes]) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/text_embeddings_transformer.py:80](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/text_embeddings_transformer.py:80), in (.0) 79 def _nodes_to_vec(self, X: NodeFeatures) -> np.ndarray: ---> 80 return np.array([self._embedding_model.embed_node(node) for node in X.nodes]) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/embeddings_model.py:84](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/embeddings_model.py:84), in AutoModelEmbeddings.embed_node(self, node) 82 def embed_node(self, node: NodeInfo[str]) -> np.array: 83 text = f"{node.data['efo_label']}: {node.data['efo_definition']}" ---> 84 return self.embed_text(text) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/embeddings_model.py:89](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/text_embeddings/embeddings_model.py:89), in AutoModelEmbeddings.embed_text(self, text) 86 def embed_text(self, text: str) -> np.ndarray: 87 # Note: We could cache keyed on node id, but text seems safer (albeit less space efficient) 88 cache_key = sha1(text.encode()).hexdigest() ---> 89 if cache_key in self._cache: 90 self._counter[f"{self.__class__.__name__}[/CACHE_HIT](https://file+.vscode-resource.vscode-cdn.net/CACHE_HIT)"] += 1 91 return pickle.loads(self._cache[cache_key]) File [/usr/lib/python3.10/_collections_abc.py:830](https://file+.vscode-resource.vscode-cdn.net/usr/lib/python3.10/_collections_abc.py:830), in Mapping.__contains__(self, key) 828 def __contains__(self, key): 829 try: --> 830 self[key] 831 except KeyError: 832 return False File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/gpt_tagger/_cache.py:116](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/gpt_tagger/_cache.py:116), in LazyLSM.__getitem__(self, _LazyLSM__k) 115 def __getitem__(self, __k: str) -> bytes: --> 116 self.__open_db() 117 return self._lsm.__getitem__(__k) File [~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/gpt_tagger/_cache.py:103](https://file+.vscode-resource.vscode-cdn.net/home/dhimmel-rs/Documents/repos/nxontology-data/nxontology_data/efo/~/.cache/pypoetry/virtualenvs/nxontology-data-M7MYg8-I-py3.10/lib/python3.10/site-packages/nxontology_ml/gpt_tagger/_cache.py:103), in LazyLSM.__open_db(self) 101 def __open_db(self) -> None: 102 if not self._lsm: --> 103 self._lsm = LSM(filename=self._filename, **self._lsm_kwargs) File lsm.pyx:388, in lsm.LSM.__init__() File lsm.pyx:407, in lsm.LSM.open() File lsm.pyx:228, in lsm._check() Exception: Unknown error ```

How essential is pytorch currently? I notice it slows the poetry install and requires some extra pyproject declaration. So wondering if there is an easy alternative that would remove this dependency without requiring more code or making things much slower?

yonromai commented 11 months ago

@yonromai any insights into this error?

Not quite sure I'll attempt to reproduce.

How essential is pytorch currently?

I wonder if it's a relic from the FAISS days. I'll see if it can be removed.

yonromai commented 11 months ago

@dhimmel I took a quick look:

After 27 minutes,

Sorry about the poor UX, sucks that:

It took so long
There was no logging of the work happening
(The ting eventually crashed, obviously)

Here is a PR that addresses (2) (along with the Ultimate Question of Life, the Universe, and Everything): #42

Regarding (1), here is a breakdown of where the time is spent (from the PR ^):	Step	mm:ss
NodeInfoFeatures: Computing num features	10:59
TherapeuticAreaFeatures: Computing num features	01:41
Fetching node embeddings	13:16
Model training	01:38

With the embeddings cached, the whole operation used to take ~3min - so I have 2 questions for you:

Did some logic / data change about node data attribute (which would explain why NodeInfoFeatures takes a lot longer than before)? If not I'll need to investigate further.
Would there be a way to pre-compute the node embeddings & cache them somewhere available from the build?

it threw the following error. @yonromai any insights into this error?

I didn't get the same error. I wonder if it is due to system differences.

Which OS & python version are you using?
Could you provide more steps so that I can reproduce the error?

Now I get the following error during prediction (assert introduced in PR ^):

Traceback (most recent call last):
  File "/Users/romain/dev/nxontology-ml/foo.py", line 7, in <module>
    precision_df = nxontology_ml_train_predict(nxo=nxo)
  File "/Users/romain/dev/nxontology-ml/nxontology_ml/model/predict.py", line 125, in train_predict
    return predict(
  File "/Users/romain/dev/nxontology-ml/nxontology_ml/model/predict.py", line 55, in predict
    assert len(target_nodes) > 0, f"No disease nodes found"
AssertionError: No disease nodes found

This means that:

get_disease_nodes doesn't return any node anymore. Did something change in the node metadata?
The training phase finished successfully, this error is during the prediction phase (your error got triggered in the training phase)

How essential is pytorch currently?

Turns out it is pretty important since pytorch is needed to get the BioLinkBERT embeddings from text. When the pip cache is warm, it doesn't take that long... => Would there be a way to cache pip packages during the build?

dhimmel commented 11 months ago

Did some logic / data change about node data attribute

Perhaps I forgot to freeze the graph to enable caching of node info properties. I edited to code above to add that.

I also updated the commit hash for the nxo source to ensure we're using the latest data.

Would there be a way to pre-compute the node embeddings & cache them somewhere available from the build?

Yes I think this could be a good idea. nxontology-data could be responsible for storing and updating the cache. Do you think that's better than having this package contain the cache which would then get outdated?

Would there be a way to cache pip packages during the build?

Probably but 90 second install time is not a big problem, so let's just keep as is.

Which OS & python version are you using?

Linux and Python 10

yonromai commented 11 months ago

Perhaps I forgot to freeze the graph to enable caching of node info properties. I edited to code above to add that.

Indeed, after freezing the graph both NodeInfoFeatures and TherapeuticAreaFeatures are pretty much instant. Now the overwhelming majority of the time is taken by fetching the text embeddings (unless cached).

Yes I think this could be a good idea. nxontology-data could be responsible for storing and updating the cache. Do you think that's better than having this package contain the cache which would then get outdated?

I think that maintaining an updated cache is probably the most sustainable way to go.

However, working with a static cache would be lot less work and probably still help a lot (& it might take a while for the cache to have a lot of misses and become less useful). The cache file could be a little large for git (53M), but could work. Also, perhaps we could manually upload that file to a readonly bucket and pull it during the build (as an intermediary solution).

Linux and Python 10

Okay, it's odd since the tests seem to work on the linux build. I guess I'll try to reproduce the error in docker on my mac.

More questions for you:

Which distro / version are you using?
How to you install nxoml? (I'm cloning the code, which probably differs to your way - wanna get as close to your setup as possible)
Regarding the No disease nodes found error I'm getting, do you know if anything has changed in the graph's TA properties?

Note: I won't have a lot of dev time until Thurday so it might take me a couple of days to fix this.

dhimmel commented 11 months ago

Indeed, after freezing the graph both NodeInfoFeatures and TherapeuticAreaFeatures are pretty much instant

Probably best for train_predict always to freeze the nxo in case the user forgets. An unfrozen graph doesn't make sense in this context, because editing the structure would invalidate all of the node info features.

The cache file could be a little large for git (53M)

Does it compress much for storage purposes? What does this cache contain... the description text to embedded vectors?

Okay, it's odd since the tests seem to work on the linux build.

I think the issue is more likely related to running outside of the git repo structure and running when installed as a package. I was running it inside the nxontonlogy-data poetry environment from https://github.com/related-sciences/nxontology-data/pull/20.

I won't have a lot of dev time until Thurday

Not to worry. I'll be around Thursday, but not Friday. So we can also pick up next week.

dhimmel commented 11 months ago

Regarding the LSM "Unknown error", I wonder if the cache got corrupted. When I just ran, it looked like things were working, but then I stopped the process mid-way and now the error is back. It would be good to log where the LSM cache is stored... not actually sure where it is and hence how to delete it to test my theory.

yonromai commented 11 months ago

@dhimmel

Regarding the LSM "Unknown error", I wonder if the cache got corrupted. When I just ran, it looked like things were working, but then I stopped the process mid-way and now the error is back.

That could totally be it, unless it's a difference of environment. Still need to get into it

It would be good to log where the LSM cache is stored...

That's a good point, I'll add a logging.debug statement.

not actually sure where it is and hence how to delete it to test my theory.

The cache should be in REPO_DIR / .cache / michiyasunaga_BioLinkBERT_base.ldb

yonromai commented 11 months ago

Probably best for train_predict always to freeze the nxo in case the user forgets. An unfrozen graph doesn't make sense in this context, because editing the structure would invalidate all of the node info features.

Noted, addressed in #44

Does it compress much for storage purposes? What does this cache contain... the description text to embedded vectors?

I haven't tried compressing the data but what's stored in the cache is a hash (md5) of the node text and the vector for each node. Given that all the data is very random I wouldn't expect compression to be especially effective.

I think the issue is more likely related to running outside of the git repo structure and running when installed as a package. I was running it inside the nxontonlogy-data poetry environment from https://github.com/related-sciences/nxontology-data/pull/20.

Makes sense, will look into that.

yonromai commented 11 months ago

I think the issue is more likely related to running outside of the git repo structure and running when installed as a package. I was running it inside the nxontonlogy-data poetry environment from related-sciences/nxontology-data#20.

It tried to reproduce the error by:

Cloning nxontology-data
change branch to dsh-efo-node-classification
Running process_efo()

But I get the following error:

...
DEBUG:root:[uniprot.var] formatter does not end with $1: UniProt Variants
DEBUG:root:[wikigenes] formatter does not end with $1: WikiGenes
DEBUG:root:[wikigenes] formatter does not end with $1: WikiGenes
Traceback (most recent call last):
  File "/Users/romain/dev/nxontology-data/nxontology_data/efo/efo.py", line 374, in <module>
    process_efo()
  File "/Users/romain/dev/nxontology-data/nxontology_data/efo/efo.py", line 365, in process_efo
    processor.write_outputs()
  File "/Users/romain/dev/nxontology-data/nxontology_data/efo/efo.py", line 314, in write_outputs
    self.get_mapping_properties_df(),
  File "/Users/romain/dev/nxontology-data/nxontology_data/efo/efo.py", line 153, in get_mapping_properties_df
    converter.add_prefix(
  File "/Users/romain/dev/nxontology-data/.venv/lib/python3.10/site-packages/curies/api.py", line 573, in add_prefix
    self.add_record(record, case_sensitive=case_sensitive, merge=merge)
  File "/Users/romain/dev/nxontology-data/.venv/lib/python3.10/site-packages/curies/api.py", line 484, in add_record
    raise ValueError(f"new record already exists and merge=False: {matched}")
ValueError: new record already exists and merge=False: {('icd10cm', 'https://icd.codes/icd10cm/', 'ICD10CM', 'http://bioregistry.io/ICD10CM:,http://bioregistry.io/icd10cm:,http://icd.codes/icd10cm/,http://purl.bioontology.org/ontology/ICD10CM/,https://bioregistry.io/ICD10CM:,https://bioregistry.io/icd10cm:,https://purl.bioontology.org/ontology/ICD10CM/'): ['URI prefix match']}

@dhimmel Is there something I'm doing wrong?

dhimmel commented 11 months ago

Ah that is an unrelated problem. Can we avoid calling process_efo and just loading an already built nxontology?

yonromai commented 11 months ago

Can we avoid calling process_efo and just loading an already built nxontology?

Sure, I'm just unsure how to reproduce the issue you had in the simplest fashion. Any recommendation?

dhimmel commented 11 months ago

I'm just unsure how to reproduce the issue you had in the simplest fashion

From a nxontology-data environment, try running the commands at https://github.com/related-sciences/nxontology-ml/issues/35#issuecomment-1755770886.

Likely want to update the nxontology-ml commit in pyproject first.

yonromai commented 11 months ago

From a nxontology-data environment, try running the commands at https://github.com/related-sciences/nxontology-ml/issues/35#issuecomment-1755770886.

I see, that's a good place to start

Likely want to update the nxontology-ml commit in pyproject first.

yes i did that

yonromai commented 11 months ago

@dhimmel Okay so running your snipped did reproduce the issue, it's a pretty stupid bug (cache parent dir needs to exist) - fix coming soon.

Also, I realized the cache dir is within the venv dir (e.g. /Users/romain/dev/nxontology-data/.venv/lib/python3.10/site-packages/.cache/michiyasunaga_BioLinkBERT_base.ldb) which has the benefit of being removed with the python venv, but is a slightly odd location. Do you have a preference in terms of where that cache dir should live?

dhimmel commented 11 months ago

Do you have a preference in terms of where that cache dir should live?

Let's allow the user to pass a location to train_predict. As far as the best default that makes sense when running nxontology-ml from its source repo and when as an imported library... not sure.

Slightly related is the comment at https://github.com/related-sciences/nxontology-ml/pull/41#issuecomment-1755685996. I think we should move data/efo_otar_slim_v3.43.0_rs_classification.tsv into the package directory.

yonromai commented 11 months ago

Let's allow the user to pass a location to train_predict. As far as the best default that makes sense when running nxontology-ml from its source repo and when as an imported library... not sure.

Slightly related is the comment at #41 (comment). I think we should move data/efo_otar_slim_v3.43.0_rs_classification.tsv into the package directory.

Cool, I think I'll create a separate this into a new issue and work on it next week.

Note: I ran the model but there is still the error that I mentioned above:

Traceback (most recent call last):
  File "/Users/romain/dev/nxontology-data/foo.py", line 11, in <module>
    precision_df = nxontology_ml_train_predict(nxo=nxo)
  File "/Users/romain/dev/nxontology-data/.venv/lib/python3.10/site-packages/nxontology_ml/model/predict.py", line 125, in train_predict
    return predict(
  File "/Users/romain/dev/nxontology-data/.venv/lib/python3.10/site-packages/nxontology_ml/model/predict.py", line 55, in predict
    assert len(target_nodes) > 0, "No disease node found"
AssertionError: No disease node found

dhimmel commented 11 months ago

Note: I ran the model but there is still the error that I mentioned above:

I think this is because take is defaulting to 0 instead of None and that causes nothing to be returned by

https://github.com/related-sciences/nxontology-ml/blob/bbe83ba492cff05792b1959d805814e9840b8f4a/nxontology_ml/model/predict.py#L23-L32

yonromai commented 11 months ago

I think this is because take is defaulting to 0 instead of None and that causes nothing to be returned by

Of course you're right - I don't usually test the default values... good catch! PR coming up

dhimmel commented 10 months ago

We have a successful run of the EFO build in nxontology-data that incorporates the nxontology-ml predictions. 🎉

Will close with a follow up in https://github.com/related-sciences/nxontology-ml/issues/46.

Also, I'm curious if we compute features for non-diseases currently and whether we should include non-disease rows in the output file? Currently, nxontology-data assumes that any node not in the output prediction table is a non_disease. This is probably the right approach if we're not computing features for non diseases.

yonromai commented 10 months ago

We have a successful run of the EFO build in nxontology-data that incorporates the nxontology-ml predictions. 🎉

🎉🎉

Currently, nxontology-data assumes that any node not in the output prediction table is a non_disease. This is probably the right approach if we're not computing features for non diseases.

💯

Also, I'm curious if we compute features for non-diseases currently and whether we should include non-disease rows in the output file?

We don't compute them at the moment, but that can be changed.

related-sciences / nxontology-ml

Integrating nxontology-ml predictions with the nxontology-data efo output #35