pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 12 forks source link

Metadata standardization models #477

Closed nsheff closed 2 months ago

nsheff commented 5 months ago

@saanikat , here is a proposal for how to organize code for a standardized attribute name predictor.

Generic infrastructure idea

Write a function that takes a PEP, or a PEPhub registry path, and produces predicted labels. I think the best package for this function would probably be inside peppy (but with all dependencies as an optional class of dependencies).

def standardize_attr_names(pep: peppy.Project, attr_std: AttrStandardizer) -> dict:
    """
    This function takes a PEP and a standardizer model, and returns suggestions
    for how to rename the sample attributes.
    """
    raise NotImplementedError()

# Version to allow passing either a path to a local pep, or a registry path to pephub
def standardize_attr_names_from_path(pep_path:str, **kwargs) -> dict:
    if is_registry_path(pep_path):
        loaded_pep = load_pep_from_registry(pep_path)
    else:
        loaded_pep = peppy.Project(pep_path)

    return standardize_attr_names(loaded_pep)

class AttrStandardizer(object):
    self.schema  # holds the schema to standardize into
    self.model   # holds the model that will do the predictionj
    def train(self, ...):
        pass

Th return value of the standardize_attr_names could be a dict that maps old attr names to new ones. Where each proposal is again a dictionary, with keys corresponding to suggested names and values corresponding to prediction confidence:

{
  "old_attr": { "new_top_suggestion": 0.54, "second_ranked_option": 0.36, ... }  # less-confident prediction
  "cell-type": { "cell_type": 0.99 }  # confident prediction
}

BEDbase application

This is generic system that could be used for any type of metadata tables. We or others could train different models that would predict into different schemas. You would select which schema to predict into by giving the appropriate model, which has been trained on a different dataset and with a different schema.

To apply this to a BEDbase task, we'd train an AttrStandardizer object using the curated BED datasets, with a curated minimal BED metadata schema. We'd save this model independently, and then load it from disk when we need to use the model to make a prediction. Add the path to the model to the BEDboss config and then integrate these kinds of predictions into BEDboss.

saanikat commented 4 months ago

Added model 1(column value based) and model 2 (column value and header based). https://github.com/databio/bedmess/tree/master

saanikat commented 2 months ago

As per discussion, attr_standardizer model is in bedmess repository: https://github.com/databio/bedmess/blob/master/attr_standardizer.py

According to what was concluded in the meeting, all further issues will be created in the bedmess repository (https://github.com/databio/bedmess)