Add text embedding features + experiments

related-sciences / nxontology-ml

Machine learning to classify ontology nodes

Apache License 2.0

6 stars 0 forks source link

Add text embedding features + experiments #12

Closed yonromai closed 1 year ago

yonromai commented 1 year ago

Hi,

This PR has 2 commits:

The first one adds sklearn transformers to add text embeddings as features, with optional dimensionality reduction & KNN tree search
The second one adds the "experiment" code which glues all the experiments together and records various model metadata for later analysis

I'm sorry @dhimmel, this is a lot more code than I'd like (including tests resources & notebook) - which is gonna make it very impractical to review :(

One thing I did is to keep the more "production" ready code (first commit: feature builders) in the nxontology_ml dir, while the experimentation code contains the notebook, experimentation & metadata logic. IMO it would be fair to spend more time on the first commit and we can do a second review pass when moving the relevant code from experimentation to nxontology_ml.

eric-czech commented 1 year ago

The second one adds the "experiment" code which glues all the experiments together and records various model metadata for later analysis

@yonromai can you comment on this process some more? What were some of the steps you went through and what kind of results did it produce? Is this everything you needed in https://github.com/related-sciences/nxontology-ml/issues/8#issuecomment-1686872995?

Basically, I'm just trying to understand what all this PR enables as a user/consumer of this repo.

yonromai commented 1 year ago

@yonromai can you comment on this process some more?

@eric-czech Thanks for the question, I just realized that I provided very little context about this PR.

What were some of the steps you went through and what kind of results did it produce? Is this everything you needed in #8 (comment)?

Right, this PR has everything you need to reproduce the experimental results presented in #8.

Basically, I'm just trying to understand what all this PR enables as a user/consumer of this repo.

That's good point - this PR is more meant to preserve experimental history rather than being directly useful to the end user. The idea is to eventually keep the most successful experiment(s) in the latest state of the main branch, but be able to go back in time and see what decisions/experimental results led to the current choice of features/models.

Does this make sense?

eric-czech commented 1 year ago

The idea is to eventually keep the most successful experiment(s) in the latest state of the main branch, but be able to go back in time and see what decisions/experimental results led to the current choice of features/models.

Perfect 👍.