Closed nsheff closed 2 months ago
Added model 1(column value based) and model 2 (column value and header based). https://github.com/databio/bedmess/tree/master
As per discussion, attr_standardizer model is in bedmess repository: https://github.com/databio/bedmess/blob/master/attr_standardizer.py
According to what was concluded in the meeting, all further issues will be created in the bedmess repository (https://github.com/databio/bedmess)
@saanikat , here is a proposal for how to organize code for a standardized attribute name predictor.
Generic infrastructure idea
Write a function that takes a PEP, or a PEPhub registry path, and produces predicted labels. I think the best package for this function would probably be inside
peppy
(but with all dependencies as an optional class of dependencies).Th return value of the
standardize_attr_names
could be a dict that maps old attr names to new ones. Where each proposal is again a dictionary, with keys corresponding to suggested names and values corresponding to prediction confidence:BEDbase application
This is generic system that could be used for any type of metadata tables. We or others could train different models that would predict into different schemas. You would select which schema to predict into by giving the appropriate model, which has been trained on a different dataset and with a different schema.
To apply this to a BEDbase task, we'd train an
AttrStandardizer
object using the curated BED datasets, with a curated minimal BED metadata schema. We'd save this model independently, and then load it from disk when we need to use the model to make a prediction. Add the path to the model to the BEDboss config and then integrate these kinds of predictions into BEDboss.