Definition of Custom-specific type

amueller commented 3 years ago

I tried to run dabl on the benchmark, but I get a lot of errors of the type costom/context specific being classified as numeric/continuous by dabl. How did you determine this category? The dtype of many of these columns is float or in. In the training set I get 315 of these mistakes. I looked into one, https://www.kaggle.com/dorbicycle/world-foodfeed-production, from which several of the year columns belong to that category. These columns are a quantity, and dabl correctly identifies it as such, so it's a bit weird to count this as a mistake.

Are you determining the category based on inter-annotator agreement? And if so, what were the annotators given exactly?

From the column name and values it might be hard to determine the exact type, but I think the benchmark should reflect the actual type whenever possible.

pvn25 commented 3 years ago

Hi, Sorry for my late reply,

The annotation of the feature types was performed by looking at the base features of the column, which is the column name, 5 randomly sampled attribute values, and descriptive stats about the column (e.g., %distinct vals, %Nans, mean, etc.). We have mentioned the complete list of base features in Section 3.3 of the tech report (https://adalabucsd.github.io/papers/TR_2021_SortingHat.pdf).

Thus, annotators did not inspect things like their source description pages, data dictionaries, or speak to the data creator when performing labeling. We had many inter-annotator discussions to resolve the label for ambiguous examples. We call something as Context-Specific where either it was really hard for us to determine its feature type or there were complex objects like nested objects, locations, addresses, etc. Thus, it requires user intervention to deal with it appropriately.

You are correct that Custom-specifics can very well be Numeric/Categorical in truth. But, looking at the base features, it was hard for us to recognize them correctly. We label something as Numeric/Categorical only if we are confident with the labeling, as we see in our downstream benchmark that wrong type inference can affect downstream ML accuracy significantly.

I agree that it's not dabl's fault if it is indeed able to recognize the types correctly. This is why we binarize the vocabularies in our benchmark with the metrics being class-level precision, recall, 2-class binarized accuracy, and f1 score. Also, the vocabularies of compared tools are different. Thus, it would make sense to look at not just 1 overall accuracy number, but multiple accuracy metrics to help one provide more confidence with the class predictions. To conclude, I think, the context-specific type can be excluded from the comparison, just like we exclude Autogluon and TFDV on context-specific.

amueller commented 3 years ago

Thank you for your reply!

But, looking at the base features, it was hard for us to recognize them correctly.

What stands out to me about this is that the annotation is somewhat tied in with your method. Other methods might use different features, and so might end up with different decisions. I think the benchmark would be more generally useful if there was a way to annotate "plausible truth" instead of "can be determined from the features we use".

Maybe the annotation could be multi-label if multiple interpretations are possible? Not sure if that makes sense.

pvn25 / ML-Data-Prep-Zoo

Definition of Custom-specific type #4