pvn25 / ML-Data-Prep-Zoo

Apache License 2.0
28 stars 7 forks source link

Some specific mislabeled cases #7

Open amueller opened 3 years ago

amueller commented 3 years ago

sunroof_solar_potential_by_censustract.csv install_size_kw_buckets is labeled as continuous but should be a list

san_francisco_street_trees.csv location is a geo-location that's a tuple, so I guess it should be labeled as a list, but is labeled as continuous

amueller commented 3 years ago

In articles2.csv, "year" and "month" are labeled categorical. dabl labels them continuous. One could argue they should be labeled as date. Any thoughts?

amueller commented 3 years ago

These here are examples of columns tagged as numbers that I would have called numeric:

     Attribute_name                  name
6856           2007  defensive_asylum.csv
6857           2008  defensive_asylum.csv
7252      Dot Balls    most-dot-balls.csv
7775         Count3       girls names.csv

they are of the form "12,345" and dabl calls them continuous.

amueller commented 3 years ago

In constructors.csv and some related files, I'm not sure how the ids should be labeled. For example in constructorResults.csv we have constructorResultsId and that is labeled as not generalizable. It's clearly an id of some sort. But if you plot it, you can see this: image So the order of the columns seems to be non-random. I find it really hard to determine if there's signal in this or not. How do you make a judgement in this case?

amueller commented 3 years ago

In HackerRank-Developer-Survey-2018-Values.csv there are many columns that have three possible values, some string, "#NULL" and missing, for example "q17HirChaHardAssessSkills" I would consider these categorical, but they are labeled as "Not Generalizable".

pvn25 commented 3 years ago

sunroof_solar_potential_by_censustract

This example is a mislabel, I will correct it in the updated version of the dataset. Thank you for pointing this out.

In articles2.csv, "year" and "month" are labeled categorical. dabl labels them continuous. One could argue they should be labeled as date. Any thoughts?

Our 9-class vocabulary governs how the ML feature types will be consumed by downstream (Auto) ML. We label a column as Datetime when it is possible to extract (temporal) features out of it. Since year and month are ordinal and can be used directly as-is for the downstream ML, we label them as Categorical.

--

These here are examples of columns tagged as numbers that I would have called numeric:

     Attribute_name                  name
6856           2007  defensive_asylum.csv
6857           2008  defensive_asylum.csv
7252      Dot Balls    most-dot-balls.csv
7775         Count3       girls names.csv

they are of the form "12,345" and dabl calls them continuous.

We call a column Numeric if it can be used directly as continuous by AutoML, without any additional processing. The columns that contain numbers with commas would require custom extraction (thus, labeled as Embedded Number) before being used as a numeric feature by all the AutoML tools that we studied. Note that the ultimate purpose of our vocabulary is to dictate how to feed signals to the downstream models without any user-in-the-loop. I'll add the discussion of the semantics behind this labeling in the tech report to make the distinctions precise for a new reader.

--

In constructors.csv and some related files, I'm not sure how the ids should be labeled. For example in constructorResults.csv we have constructorResultsId and that is labeled as not generalizable. It's clearly an id of some sort. But if you plot it, you can see this: image So the order of the columns seems to be non-random. I find it really hard to determine if there's signal in this or not. How do you make a judgement in this case?

We basically only look at the base features such as name, 5 sample values, and desc stats such as % distinct values, etc. to make a judgment.

With this example, the different signals we can look at are: column name which is constructorResultsId (an id column), total vals (11142), min val (1), max val (15639), and %distinct values (100%). thus, it's quite likely that the column would offer no discriminative power for new ids. Having said that, it is indeed possible that one can potentially obtain some features from it, perhaps with additional processing or a specialized type of domain knowledge about the data.

so in the AutoML setting where we want to automate end-to-end ML, this can cost user intervention, which we don't want. So if we are quite confident that a feature does not offer any or very little discriminative power, then it would perhaps make sense to exclude it, rather than asking the user to handle it manually.

--

In HackerRank-Developer-Survey-2018-Values.csv there are many columns that have three possible values, some string, "#NULL" and missing, for example "q17HirChaHardAssessSkills" I would consider these categorical, but they are labeled as "Not Generalizable".

We see that in the real-world datasets, values such as "#NULL" and "-999" are used to denote missing values. We label a column as Categorical if the entities belong to a closed real-world non-empty domain. If the column has only 1 non-empty value, then we would label them as Not-generalizable. I think adding the discussion of the semantics behind this labeling in the tech report would make these distinctions much more clear.

Thank you for all these comments. It is super useful and valuable for us.

amueller commented 3 years ago

Thank you for all the details. I think a couple of them were me just missing some of the context. Thank you for providing more background on the definitions. I need to go back through this and see if it all makes sense to me as well.

he columns that contain numbers with commas would require custom extraction (thus, labeled as Embedded Number) before being used as a numeric feature by all the AutoML tools that we studied.

That makes sense. However, it tightly binds to the AutoML tools that you study. What is the baseline that's required for continuous, something like pd.astype? Dabl right now will handle some of these cases automatically and so might not distinguish all of them.

Since year and month are ordinal and can be used directly as-is for the downstream ML, we label them as Categorical.

Well ordinal is not the same as categorical ;) Ordinal could be treated as continuous or categorical, but ideally as ordinal. If you treat it as categorical, you're losing the order information. Therefore I would argue that categorical and continous are both valid choices, depending on the model.

thus, it's quite likely that the column would offer no discriminative power for new ids.

I agree with the assessment, but I find it hard to crisply define it. Also, I thought "custom-specific" would be user intervention, not "not generalizable".

If the column has only 1 non-empty value, then we would label them as Not-generalizable.

I would question this from an AutoML perspective. If there's a survey that has "Male", "Female" and you only see "Male" and "Missing" in the data, then seeing "Male" is still potentially informative.

amueller commented 3 years ago

Not sure if you saw "san_francisco_street_trees.csv location is a geo-location that's a tuple, so I guess it should be labeled as a list, but is labeled as continuous" and just didn't mention? Sorry this one was a hodge-pot of lots of stuff.

With your more detailed explanations I'll go back and try to figure out if there's still things I disagree with ;)

Can you let me know when you post your updated tech report?

Again, thanks for all the feedback and all the amazing work!