Open merveenoyan opened 1 year ago
I agree it would be useful to have this information.
Some questions I would have:
Of course, we don't have to have everything right from the start, but we should have an idea of what this addition would entail. And to me, it looks like it's far from trivial.
I think it'd make sense to have this in the README as a part of the model card, we can have some method to generate as much info as we can from a given input dataframe for example.
I think the reason why Merve wanted to have them in the config.json
or a separate file is that this information could be used to improve the UI on Hub. E.g. in the inference widget, if we know the distinct values of a categorical features, the widget could allow to choose the value from a list. If this information is added to the README, it would make it more difficult to extract the information.
I see, for that I'm happy for that to be in a data-info.yml/json
kinda file. We probably don't want to make the config file too large I guess?
@adrinjalali I agree.
@merveenoyan I'm happy to take this if it still needs to be done!
@BenjaminBossan I'm happy to take this one but had a few thoughts/questions:
Thanks for taking an interest in the issue. I think there is no definite answer to your question. The initial motivation is to know in advance what options exist for categorical data to improve the widget, but I think Adrin made a good point about file size, which can easily get large if we just record all distinct values, so some kind of compromise would need to be found.
Also, for this feature to make sense, we would need to do work on the widget side as well, for which there is currently no capacity AFAIK, so I would rather not work on this feature right now.
@BenjaminBossan Sounds good! Is there another issue I could help out with?
If this is something you're willing to jump into, I think we have some room to improve the skops.io persistence format. For instance, support for me external libraries could be added, like scikeras (#388) or skorch :)
The widget is cool and everything but it's hard to see all the unique values of categorical variables, which variables are categorical or the range for continuous columns. Couple of solutions: