Have attributes of training dataset in the repository

skops-dev / skops

skops is a Python library helping you share your scikit-learn based models and put them in production

https://skops.readthedocs.io/en/stable/

MIT License

452 stars 54 forks source link

Have attributes of training dataset in the repository #266

Open merveenoyan opened 1 year ago

merveenoyan commented 1 year ago

The widget is cool and everything but it's hard to see all the unique values of categorical variables, which variables are categorical or the range for continuous columns. Couple of solutions:

Have attributes in config or README file
Have these in a separate file. Ping @skops-dev/maintainers

BenjaminBossan commented 1 year ago

I agree it would be useful to have this information.

Some questions I would have:

How would this information be collected? I don't think it's feasible to automatically derive it from the training data. Even if it's a pandas df, there is still room for ambiguity. Therefore, it sounds like the user would have to indicate the information.
What are all the different types that can exist? Categorical, ordinal, cardinal. How about time (at what resolution)? Text? Images? I don't think there is an agreed upon standard for all feature types.
Is there a standard of how to represent these types? It would be good if we didn't have to invent something new.

Of course, we don't have to have everything right from the start, but we should have an idea of what this addition would entail. And to me, it looks like it's far from trivial.

adrinjalali commented 1 year ago

I think it'd make sense to have this in the README as a part of the model card, we can have some method to generate as much info as we can from a given input dataframe for example.

BenjaminBossan commented 1 year ago

I think the reason why Merve wanted to have them in the config.json or a separate file is that this information could be used to improve the UI on Hub. E.g. in the inference widget, if we know the distinct values of a categorical features, the widget could allow to choose the value from a list. If this information is added to the README, it would make it more difficult to extract the information.

adrinjalali commented 1 year ago

I see, for that I'm happy for that to be in a data-info.yml/json kinda file. We probably don't want to make the config file too large I guess?

merveenoyan commented 1 year ago

@adrinjalali I agree.

lazarust commented 1 year ago

@merveenoyan I'm happy to take this if it still needs to be done!

lazarust commented 1 year ago

@BenjaminBossan I'm happy to take this one but had a few thoughts/questions:

When should the file be generated?
Is there a list of data types that we want to support initially? You mentioned a couple above and I agree it would be pretty hard to have all of them since there isn't an agreed-upon standard.

BenjaminBossan commented 1 year ago

Thanks for taking an interest in the issue. I think there is no definite answer to your question. The initial motivation is to know in advance what options exist for categorical data to improve the widget, but I think Adrin made a good point about file size, which can easily get large if we just record all distinct values, so some kind of compromise would need to be found.

Also, for this feature to make sense, we would need to do work on the widget side as well, for which there is currently no capacity AFAIK, so I would rather not work on this feature right now.

lazarust commented 1 year ago

@BenjaminBossan Sounds good! Is there another issue I could help out with?

BenjaminBossan commented 1 year ago

If this is something you're willing to jump into, I think we have some room to improve the skops.io persistence format. For instance, support for me external libraries could be added, like scikeras (#388) or skorch :)