openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

tecator dataset has three targets: openML version has two targets added as predictors #44

Open gsverhoeven opened 2 years ago

gsverhoeven commented 2 years ago

Hi There,

The tecator dataset is part of OpenML-Reg19, a (work in progress) suite of Regression datasets.

The dataset on OpenML has as target the fat variable. It turns out that moisture and protein are included in the dataset as predictors, which otherwise only contain absorbances from a spectrometer. I found that moisture and protein are highly predictive of fat, no need to include the absorbances for optimal prediction.

Curious, I Checked the documentation of the dataset. it turns out that this dataset as used in the literature contains three targets for prediction, with the idea to only use the absorbances as predictors.

The original publication for this dataset is here (behind a paywall).

https://pubs.acs.org/doi/pdf/10.1021/ac00029a018

I checked, and there fat was predicted using only the absorbances.

So be able to compare with published literature for this dataset, it makes sense to leave out moisture and protein from the predictors.

Any thoughts on how to incorporate this in the OpenML framework? Can we remove the two other targets from tecator? Or would this make it a new dataset? But if every subset of variables of a dataset must be added to OpenML as a new dataset a lot of duplication would occur, right?

PS here is a summary documentation for caret (https://rdrr.io/cran/caret/man/tecator.html) , where tecator is also included:

"For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry." 

Regards, Gertjan

joaquinvanschoren commented 2 years ago

Hi Gertjan,

Easiest is indeed to create a new version of the dataset. If you upload it with the same name, it will be registered as a new version.