rstudio / vetiver-r

Version, share, deploy, and monitor models
https://rstudio.github.io/vetiver-r/
Other
180 stars 27 forks source link

For discussion: `ptype` naming convention could be confusing to python users #84

Closed SamEdwardes closed 1 year ago

SamEdwardes commented 2 years ago

I have had some exposure to both the Python and R vetiver packages. I want to start a discussion around the ptype parameter name (e.g. save_ptype, ptype_data, etc.).

I think it could be confusing for python users. In pandas, there is the notion of dtypes.

At first glance, I thought ptype was some how related. This is quickly cleared up if you read the docs... However, I think there could be some initial confusion for those who do not read the docs first.

Could it be reasonable to change the name to be more verbose? E.g. save_prototype and prototype_data? I think this would make it more clear that it is not related to types, and is instead related to prototypes. Going even one step further, is prototype the right word? When I think of prototypes I think of early / pre-release versions of a product.

isabelizimm commented 2 years ago

Thank you for the issue @SamEdwardes--this is a good thing to think through (and I am glad the docs helped the initial confusion!).

On the R-side, it looks like there is such thing as a ptype in hardhat, which is the same concept of a 0 row dataframe with column info. On the Python side, I agree that the ptype/dtype is really similar wording. This might be a positive word association for Python users (ptype getting types of data set's columns, vs dtype getting types of pd.DataFrame columns), but maybe it is too similar? If the docs clear up confusion, I'm leaning towards the idea that the current implementation is okay, but I'm open to other opinions!

WDYT @juliasilge?

juliasilge commented 2 years ago

Thank you for this discussion @SamEdwardes! I am generally trying to be better about avoiding abbreviations so I see the appeal of something like save_prototype. On the other hand, the ptype concept is used not only in hardhat but also vctrs, and if I remember right was specifically named ptype to evoke the dtype idea. I believe the main difference between the two is ptype including the column names as well. Maybe let's ponder this a bit and see if other folks find this troubling.

One thought I have here is that it may be better for the Python implementation to use something that is familiar to Python users the way R users run into ptype here and there, but I don't think we literally want to say dtype since we are keeping the column names and it is really a pydantic BaseModel.

SamEdwardes commented 2 years ago

Thank you for the discussion!

I am generally trying to be better about avoiding abbreviations so I see the appeal of something like save_prototype

I would vote for save_prototype, but I understand the reasoning!

One thought I have here is that it may be better for the Python implementation to use something that is familiar to Python users the way R users run into ptype here and there, but I don't think we literally want to say dtype since we are keeping the column names and it is really a pydantic BaseModel.

Agreed we definitely do not want to call it dtype, since it is not a dtype.

One interesting place to look for inspiration on how to handle defining data types for pandas dataframes is pandera. They use a pydantic inspired approach.

import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series

class InputSchema(pa.SchemaModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

class OutputSchema(InputSchema):
    revenue: Series[float]

@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    return df.assign(revenue=100.0)

df = pd.DataFrame({
    "year": ["2001", "2002", "2003"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})

transform(df)

invalid_df = pd.DataFrame({
    "year": ["2001", "2002", "1999"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})
transform(invalid_df)

What about something like this for vetiver, where you pass in a Pydantic model to prototype_data?

The R equivalent could be similar to how you define column types in readr:

df3 <- read_csv(
  readr_example("challenge.csv"), 
  col_types = list(
    x = col_double(),
    y = col_date(format = "")
  )
)
isabelizimm commented 2 years ago

What about something like this for vetiver, where you pass in a Pydantic model to prototype_data?

Yes! We are actually using Pydantic's dynamic BaseModel creation on the python side for ptype, and I'm debating moving to Pandera to solve some issues around batch prediction. Would love to hash out this on the vetiver-python repo!

For naming, I agree that the familiarity of dtype vs ptype could be useful when contextualizing how these pieces work, and also can see why verbosity could help initial learning. +1 to "ponder this a bit and see if other folks find this troubling"

isabelizimm commented 1 year ago

Internally, it sounds like ptype still feels confusing/unfamiliar to Python users. Suggested other names are: schema, sample_data, (and open to others).

Would the divergence of these two names feel counter-intuitive for those spanning languages? If not, what would be an appropriate name change on the Python side?

juliasilge commented 1 year ago

FWIW changing the name for the argument on the R side is very much still on the table; ptype comes from vctrs and hardhat but I have learned in the intervening months since this issue started that most folks who get to deploying a model haven't run into that term so the value of having it the same may not be high. It's also not an argument required for R users so changing it would likely not cause too much friction.

MLflow calls this the model signature (well, almost -- they say the signature includes both inputs and outputs whereas we are focusing on input). Is using the same term as MLflow good or bad? I'm not sure.

I am pretty strongly against something that includes sample since that evokes sampling.

Another idea here is to change to prototype, giving us save_prototype like @SamEdwardes originally suggested. A benefit of this is that we would probably not need to change much of the internal and developer-facing functionality.

juliasilge commented 1 year ago

We had a discussion about this and decided to change the argument name to use the full word "prototype" whenever possible for both R and Python. This will look like prototype_data for Python, save_prototype for R, and so forth. For now, we aren't going to do a complete internal scrub of ptype for the internal functions and developer-facing functions; we can reevaluate if folks in those roles (adding new model to vetiver, for example) find that confusing.