Feature/labels transfer

VladimirShitov commented 1 year ago

Implement KNN classifier
Implement XGBoost classifier
- Save models
- Retrain models only if necessary (i.e. the model doesn't exist or the user forces retraining)

rcannood commented 1 year ago

Hey @VladimirShitov!

In order to fully wrap around what functionality is being added, could you:

Add a src/label_transfer/README.md?

Add a schema for the h5ad files. Example from a openproblemsv2:

- type: "file"
name: "--input_train"
description: "The training data"
info:
  short_description: "Training data"
  slots:
    layers:
    - type: "integer"
      name: "counts"
      description: "Raw counts"
    - type: "double"
      name: "normalized"
      description: "Normalized counts"
    obs:
    - type: "string"
      name: "label"
      description: "Ground truth cell type labels"
    - type: "string"
      name: "batch"
      description: "Batch information"
    var:
    - type: "boolean"
      name: "hvg"
      description: "Whether or not the feature is considered to be a 'highly variable\
        \ gene'"
      required: true
    - type: "integer"
      name: "hvg_score"
      description: "A ranking of the features by hvg."
      required: true
    obsm:
    - type: "double"
      name: "X_pca"
      description: "The resulting PCA embedding."
      required: true
    uns:
    - type: "string"
      name: "dataset_id"
      description: "A unique identifier for the dataset"
      required: true
    - type: "string"
      name: "normalization_id"
      description: "Which normalization was used"
      required: true

VladimirShitov commented 1 year ago

Is such schema a requirement or a recommendation? E.g. will it be used for validation?

This component is pretty flexible: you can run it with different .obs columns and layers for training if you correctly specify arguments. So I can provide an example schema but wouldn't recommend following it strictly (e.g. always name column for labels transfer "label")

rcannood commented 1 year ago

Is such schema a requirement or a recommendation? E.g. will it be used for validation?

At the moment it's will be used for documentation.

However, in OpenProblems it's also used for automating some of the unit tests, in that I implemented a test which will run whatever executable it gets and checks whether the output files contain the right slots. It might be something that we could also look into for OpenPipelines.

rcannood commented 11 months ago

Hi Vladimir,

I would like to see this PR merged. What do you think needs to happen before we can merge it? :)

Robrecht

VladimirShitov commented 11 months ago

@rcannood I added schemas and answered your comments. Thank you very much for the feedback! Could you check PR again?

Especially I am concerned if I described the .h5mu files format correctly in schemas.

rcannood commented 11 months ago

I'm taking a look at prepping this branch to be merged into main.

So far, I updated the file format yamls and moved them to a separate API folder. I added 'merge' statements in the components to make them use the file format specs defined earlier.

ddemaeyer commented 10 months ago

@rcannood What is the state of this pull request ATM?

VladimirShitov commented 8 months ago

Thank you @rcannood ! Looks great to me. Regarding the first 4 points, I completely agree. For the last one, I support your reasoning, but in some cases, people only use .X to store the embedding. This is the case for HLCA, which we use as an example and test.

Alternatively, we can have 2 arguments, e.g., use_X and reference_obsm_features. If the first one is present or the second one is absent, .X is used

rcannood commented 8 months ago

Ok! I find it quite strange to store an embedding in X, but let's roll with it.

I refactored the tests a bit.

Merging when the CI tests succeed!

rcannood commented 8 months ago

:partying_face:

openpipelines-bio / openpipeline

Feature/labels transfer #221