uci-ml-repo / ucimlrepo

Python package for dataset imports from UCI ML Repository
MIT License
195 stars 77 forks source link

unable to import abalone data via python #2

Closed ducknificient closed 11 months ago

ducknificient commented 11 months ago

i tried to use the abalone example but it's not running in the google collab

from ucimlrepo import fetch_ucirepo 

# fetch dataset 
abalone = fetch_ucirepo(id=1) 

# data (as pandas dataframes) 
X = abalone.data.features 
y = abalone.data.targets 

# metadata 
print(abalone.metadata) 

# variable information 
print(abalone.variables) 

here's the error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-51-ed502c7fdb99>](https://localhost:8080/#) in <cell line: 4>()
      2 
      3 # fetch dataset
----> 4 abalone = fetch_ucirepo(id=1)
      5 
      6 # data (as pandas dataframes)

[/usr/local/lib/python3.10/dist-packages/ucimlrepo/fetch.py](https://localhost:8080/#) in fetch_ucirepo(name, id)
    146     # make nested metadata fields accessible via dot notation
    147     metadata['additional_info'] = dotdict(metadata['additional_info'])
--> 148     metadata['intro_paper'] = dotdict(metadata['intro_paper'])
    149 
    150     # construct result object

TypeError: 'NoneType' object is not iterable
ducknificient commented 11 months ago

it's found out that the intro_paper value is null

{
  "uci_id": 1,
  "name": "Abalone",
  "repository_url": "https://archive.ics.uci.edu/dataset/1/abalone",
  "data_url": "https://archive.ics.uci.edu/static/public/1/data.csv",
  "abstract": "Predict the age of abalone from physical measurements",
  "area": "Life Science",
  "tasks": ["Classification", "Regression"],
  "characteristics": ["Tabular"],
  "num_instances": 4177,
  "num_features": 8,
  "feature_types": ["Categorical", "Integer", "Real"],
  "demographics": [],
  "target_col": ["Rings"],
  "index_col": null,
  "has_missing_values": "no",
  "missing_values_symbol": null,
  "year_of_dataset_creation": 1994,
  "last_updated": "Mon Aug 28 2023",
  "dataset_doi": "10.24432/C55C7W",
  "creators": [
    "Warwick Nash",
    "Tracy Sellers",
    "Simon Talbot",
    "Andrew Cawthorn",
    "Wes Ford"
  ],
  "intro_paper": null,
  "variables": [
    {
      "name": "Sex",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": "M, F, and I (infant)",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "Length",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "Longest shell measurement",
      "units": "mm",
      "missing_values": "no"
    },
    {
      "name": "Diameter",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "perpendicular to length",
      "units": "mm",
      "missing_values": "no"
    },
    {
      "name": "Height",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "with meat in shell",
      "units": "mm",
      "missing_values": "no"
    },
    {
      "name": "Whole_weight",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "whole abalone",
      "units": "grams",
      "missing_values": "no"
    },
    {
      "name": "Shucked_weight",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "weight of meat",
      "units": "grams",
      "missing_values": "no"
    },
    {
      "name": "Viscera_weight",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "gut weight (after bleeding)",
      "units": "grams",
      "missing_values": "no"
    },
    {
      "name": "Shell_weight",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "after being dried",
      "units": "grams",
      "missing_values": "no"
    },
    {
      "name": "Rings",
      "role": "Target",
      "type": "Integer",
      "demographic": null,
      "description": "+1.5 gives the age in years",
      "units": null,
      "missing_values": "no"
    }
  ],
  "additional_info": {
    "summary": "Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.  Other measurements, which are easier to obtain, are used to predict the age.  Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.\r\n\r\nFrom the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).",
    "purpose": null,
    "funded_by": null,
    "instances_represent": null,
    "recommended_data_splits": null,
    "sensitive_data": null,
    "preprocessing_description": null,
    "variable_info": "Given is the attribute name, attribute type, the measurement unit and a brief description.  The number of rings is the value to predict: either as a continuous value or as a classification problem.\r\n\r\nName / Data Type / Measurement Unit / Description\r\n-----------------------------\r\nSex / nominal / -- / M, F, and I (infant)\r\nLength / continuous / mm / Longest shell measurement\r\nDiameter\t/ continuous / mm / perpendicular to length\r\nHeight / continuous / mm / with meat in shell\r\nWhole weight / continuous / grams / whole abalone\r\nShucked weight / continuous\t / grams / weight of meat\r\nViscera weight / continuous / grams / gut weight (after bleeding)\r\nShell weight / continuous / grams / after being dried\r\nRings / integer / -- / +1.5 gives the age in years\r\n\r\nThe readme file contains attribute statistics.",
    "citation": null
  }
}

here's the example of the working json with intro paper (id=45)

{
  "uci_id": 45,
  "name": "Heart Disease",
  "repository_url": "https://archive.ics.uci.edu/dataset/45/heart+disease",
  "data_url": "https://archive.ics.uci.edu/static/public/45/data.csv",
  "abstract": "4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach",
  "area": "Life Science",
  "tasks": ["Classification"],
  "characteristics": ["Multivariate"],
  "num_instances": 303,
  "num_features": 13,
  "feature_types": ["Categorical", "Integer", "Real"],
  "demographics": ["Age", "Sex"],
  "target_col": ["num"],
  "index_col": null,
  "has_missing_values": "yes",
  "missing_values_symbol": "NaN",
  "year_of_dataset_creation": 1989,
  "last_updated": "Mon Aug 28 2023",
  "dataset_doi": "10.24432/C52P4X",
  "creators": [
    "Andras Janosi",
    "William Steinbrunn",
    "Matthias Pfisterer",
    "Robert Detrano"
  ],
  "intro_paper": {
    "title": "International application of a new probability algorithm for the diagnosis of coronary artery disease.",
    "authors": "R. Detrano, A. J\u00e1nosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, V. Froelicher",
    "published_in": "American Journal of Cardiology",
    "year": 1989,
    "url": "https://www.semanticscholar.org/paper/a7d714f8f87bfc41351eb5ae1e5472f0ebbe0574",
    "doi": null
  },
  "variables": [
    {
      "name": "age",
      "role": "Feature",
      "type": "Integer",
      "demographic": "Age",
      "description": null,
      "units": "years",
      "missing_values": "no"
    },
    {
      "name": "sex",
      "role": "Feature",
      "type": "Categorical",
      "demographic": "Sex",
      "description": null,
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "cp",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": null,
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "trestbps",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "resting blood pressure (on admission to the hospital)",
      "units": "mm Hg",
      "missing_values": "no"
    },
    {
      "name": "chol",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "serum cholestoral",
      "units": "mg/dl",
      "missing_values": "no"
    },
    {
      "name": "fbs",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": "fasting blood sugar > 120 mg/dl",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "restecg",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": null,
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "thalach",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "maximum heart rate achieved",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "exang",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": "exercise induced angina",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "oldpeak",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "ST depression induced by exercise relative to rest",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "slope",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": null,
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "ca",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "number of major vessels (0-3) colored by flourosopy",
      "units": null,
      "missing_values": "yes"
    },
    {
      "name": "thal",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": null,
      "units": null,
      "missing_values": "yes"
    },
    {
      "name": "num",
      "role": "Target",
      "type": "Integer",
      "demographic": null,
      "description": "diagnosis of heart disease",
      "units": null,
      "missing_values": "no"
    }
  ],
  "additional_info": {
    "summary": "This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.  In particular, the Cleveland database is the only one that has been used by ML researchers to date.  The \"goal\" field refers to the presence of heart disease in the patient.  It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).  \n   \nThe names and social security numbers of the patients were recently removed from the database, replaced with dummy values.\n\nOne file has been \"processed\", that one containing the Cleveland database.  All four unprocessed files also exist in this directory.\n\nTo see Test Costs (donated by Peter Turney), please see the folder \"Costs\" ",
    "purpose": null,
    "funded_by": null,
    "instances_represent": null,
    "recommended_data_splits": null,
    "sensitive_data": null,
    "preprocessing_description": null,
    "variable_info": "Only 14 attributes used:\r\n      1. #3  (age)       \r\n      2. #4  (sex)       \r\n      3. #9  (cp)        \r\n      4. #10 (trestbps)  \r\n      5. #12 (chol)      \r\n      6. #16 (fbs)       \r\n      7. #19 (restecg)   \r\n      8. #32 (thalach)   \r\n      9. #38 (exang)     \r\n      10. #40 (oldpeak)   \r\n      11. #41 (slope)     \r\n      12. #44 (ca)        \r\n      13. #51 (thal)      \r\n      14. #58 (num)       (the predicted attribute)\r\n\r\nComplete attribute documentation:\r\n      1 id: patient identification number\r\n      2 ccf: social security number (I replaced this with a dummy value of 0)\r\n      3 age: age in years\r\n      4 sex: sex (1 = male; 0 = female)\r\n      5 painloc: chest pain location (1 = substernal; 0 = otherwise)\r\n      6 painexer (1 = provoked by exertion; 0 = otherwise)\r\n      7 relrest (1 = relieved after rest; 0 = otherwise)\r\n      8 pncaden (sum of 5, 6, and 7)\r\n      9 cp: chest pain type\r\n        -- Value 1: typical angina\r\n        -- Value 2: atypical angina\r\n        -- Value 3: non-anginal pain\r\n        -- Value 4: asymptomatic\r\n     10 trestbps: resting blood pressure (in mm Hg on admission to the hospital)\r\n     11 htn\r\n     12 chol: serum cholestoral in mg/dl\r\n     13 smoke: I believe this is 1 = yes; 0 = no (is or is not a smoker)\r\n     14 cigs (cigarettes per day)\r\n     15 years (number of years as a smoker)\r\n     16 fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)\r\n     17 dm (1 = history of diabetes; 0 = no such history)\r\n     18 famhist: family history of coronary artery disease (1 = yes; 0 = no)\r\n     19 restecg: resting electrocardiographic results\r\n        -- Value 0: normal\r\n        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)\r\n        -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria\r\n     20 ekgmo (month of exercise ECG reading)\r\n     21 ekgday(day of exercise ECG reading)\r\n     22 ekgyr (year of exercise ECG reading)\r\n     23 dig (digitalis used furing exercise ECG: 1 = yes; 0 = no)\r\n     24 prop (Beta blocker used during exercise ECG: 1 = yes; 0 = no)\r\n     25 nitr (nitrates used during exercise ECG: 1 = yes; 0 = no)\r\n     26 pro (calcium channel blocker used during exercise ECG: 1 = yes; 0 = no)\r\n     27 diuretic (diuretic used used during exercise ECG: 1 = yes; 0 = no)\r\n     28 proto: exercise protocol\r\n          1 = Bruce     \r\n          2 = Kottus\r\n          3 = McHenry\r\n          4 = fast Balke\r\n          5 = Balke\r\n          6 = Noughton \r\n          7 = bike 150 kpa min/min  (Not sure if \"kpa min/min\" is what was written!)\r\n          8 = bike 125 kpa min/min  \r\n          9 = bike 100 kpa min/min\r\n         10 = bike 75 kpa min/min\r\n         11 = bike 50 kpa min/min\r\n         12 = arm ergometer\r\n     29 thaldur: duration of exercise test in minutes\r\n     30 thaltime: time when ST measure depression was noted\r\n     31 met: mets achieved\r\n     32 thalach: maximum heart rate achieved\r\n     33 thalrest: resting heart rate\r\n     34 tpeakbps: peak exercise blood pressure (first of 2 parts)\r\n     35 tpeakbpd: peak exercise blood pressure (second of 2 parts)\r\n     36 dummy\r\n     37 trestbpd: resting blood pressure\r\n     38 exang: exercise induced angina (1 = yes; 0 = no)\r\n     39 xhypo: (1 = yes; 0 = no)\r\n     40 oldpeak = ST depression induced by exercise relative to rest\r\n     41 slope: the slope of the peak exercise ST segment\r\n        -- Value 1: upsloping\r\n        -- Value 2: flat\r\n        -- Value 3: downsloping\r\n     42 rldv5: height at rest\r\n     43 rldv5e: height at peak exercise\r\n     44 ca: number of major vessels (0-3) colored by flourosopy\r\n     45 restckm: irrelevant\r\n     46 exerckm: irrelevant\r\n     47 restef: rest raidonuclid (sp?) ejection fraction\r\n     48 restwm: rest wall (sp?) motion abnormality\r\n        0 = none\r\n        1 = mild or moderate\r\n        2 = moderate or severe\r\n        3 = akinesis or dyskmem (sp?)\r\n     49 exeref: exercise radinalid (sp?) ejection fraction\r\n     50 exerwm: exercise wall (sp?) motion \r\n     51 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect\r\n     52 thalsev: not used\r\n     53 thalpul: not used\r\n     54 earlobe: not used\r\n     55 cmo: month of cardiac cath (sp?)  (perhaps \"call\")\r\n     56 cday: day of cardiac cath (sp?)\r\n     57 cyr: year of cardiac cath (sp?)\r\n     58 num: diagnosis of heart disease (angiographic disease status)\r\n        -- Value 0: < 50% diameter narrowing\r\n        -- Value 1: > 50% diameter narrowing\r\n        (in any major vessel: attributes 59 through 68 are vessels)\r\n     59 lmt\r\n     60 ladprox\r\n     61 laddist\r\n     62 diag\r\n     63 cxmain\r\n     64 ramus\r\n     65 om1\r\n     66 om2\r\n     67 rcaprox\r\n     68 rcadist\r\n     69 lvx1: not used\r\n     70 lvx2: not used\r\n     71 lvx3: not used\r\n     72 lvx4: not used\r\n     73 lvf: not used\r\n     74 cathef: not used\r\n     75 junk: not used\r\n     76 name: last name of patient  (I replaced this with the dummy string \"name\")",
    "citation": null
  }
}
s0564386 commented 11 months ago

I have the same issue for the Statlog (German Credit Data) but when I added the checks I still was unable to use the dataset.

Phil1216 commented 11 months ago

Similar issue for Iris and Adult. Found any workarounds?

ducknificient commented 11 months ago

I have the same issue for the Statlog (German Credit Data) but when I added the checks I still was unable to use the dataset.

i tested #PR3 and #PR4 and both working. Are you using the module from pip3 or local ?

this issues occured because the intro_paper metadata is empty. This is the metadata for Statlog (German Credit Data)

{
    "uci_id": 144,
    "name": "Statlog (German Credit Data)",
    "repository_url": "https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data",
    "data_url": "https://archive.ics.uci.edu/static/public/144/data.csv",
    "abstract": "This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix",
    "area": "Social Science",
    "tasks": [
        "Classification"
    ],
    "characteristics": [
        "Multivariate"
    ],
    "num_instances": 1000,
    "num_features": 20,
    "feature_types": [
        "Categorical",
        "Integer"
    ],
    "demographics": [
        "Other",
        "Marital Status",
        "Age",
        "Occupation"
    ],
    "target_col": [
        "class"
    ],
    "index_col": None,
    "has_missing_values": "no",
    "missing_values_symbol": None,
    "year_of_dataset_creation": 1994,
    "last_updated": "Thu Aug 10 2023",
    "dataset_doi": "10.24432/C5NC77",
    "creators": [
        "Hans Hofmann"
    ],
    "intro_paper": None,
    "variables": [
        {
            "name": "Attribute1",
            "role": "Feature",
            "type": "Categorical",
            "demographic": None,
            "description": "Status of existing checking account",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute2",
            "role": "Feature",
            "type": "Integer",
            "demographic": None,
            "description": "Duration",
            "units": "months",
            "missing_values": "no"
        },
        {
            "name": "Attribute3",
            "role": "Feature",
            "type": "Categorical",
            "demographic": None,
            "description": "Credit history",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute4",
            "role": "Feature",
            "type": "Categorical",
            "demographic": None,
            "description": "Purpose",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute5",
            "role": "Feature",
            "type": "Integer",
            "demographic": None,
            "description": "Credit amount",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute6",
            "role": "Feature",
            "type": "Categorical",
            "demographic": None,
            "description": "Savings account/bonds",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute7",
            "role": "Feature",
            "type": "Categorical",
            "demographic": "Other",
            "description": "Present employment since",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute8",
            "role": "Feature",
            "type": "Integer",
            "demographic": None,
            "description": "Installment rate in percentage of disposable income",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute9",
            "role": "Feature",
            "type": "Categorical",
            "demographic": "Marital Status",
            "description": "Personal status and sex",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute10",
            "role": "Feature",
            "type": "Categorical",
            "demographic": None,
            "description": "Other debtors / guarantors",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute11",
            "role": "Feature",
            "type": "Integer",
            "demographic": None,
            "description": "Present residence since",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute12",
            "role": "Feature",
            "type": "Categorical",
            "demographic": None,
            "description": "Property",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute13",
            "role": "Feature",
            "type": "Integer",
            "demographic": "Age",
            "description": "Age",
            "units": "years",
            "missing_values": "no"
        },
        {
            "name": "Attribute14",
            "role": "Feature",
            "type": "Categorical",
            "demographic": None,
            "description": "Other installment plans",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute15",
            "role": "Feature",
            "type": "Categorical",
            "demographic": "Other",
            "description": "Housing",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute16",
            "role": "Feature",
            "type": "Integer",
            "demographic": None,
            "description": "Number of existing credits at this bank",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute17",
            "role": "Feature",
            "type": "Categorical",
            "demographic": "Occupation",
            "description": "Job",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute18",
            "role": "Feature",
            "type": "Integer",
            "demographic": None,
            "description": "Number of people being liable to provide maintenance for",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute19",
            "role": "Feature",
            "type": "Binary",
            "demographic": None,
            "description": "Telephone",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "Attribute20",
            "role": "Feature",
            "type": "Binary",
            "demographic": "Other",
            "description": "foreign worker",
            "units": None,
            "missing_values": "no"
        },
        {
            "name": "class",
            "role": "Target",
            "type": "Binary",
            "demographic": None,
            "description": "1 = Good, 2 = Bad",
            "units": None,
            "missing_values": "no"
        }
    ],
    "additional_info": {
        "summary": "Two datasets are provided.  the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".   \r\n \r\nFor algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric".  This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables.   Several attributes that are ordered categorical (such as attribute 17) have been coded as integer.    This was the form used by StatLog.\r\n\r\nThis dataset requires use of a cost matrix (see below)\r\n\r\n ..... 1        2\r\n----------------------------\r\n  1   0        1\r\n-----------------------\r\n  2   5        0\r\n\r\n(1 = Good,  2 = Bad)\r\n\r\nThe rows represent the actual classification and the columns the predicted classification.\r\n\r\nIt is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).\r\n",
        "purpose": None,
        "funded_by": None,
        "instances_represent": None,
        "recommended_data_splits": None,
        "sensitive_data": None,
        "preprocessing_description": None,
        "variable_info": "Attribute 1:  (qualitative)      \r\n Status of existing checking account\r\n             A11 :      ... <    0 DM\r\n\t       A12 : 0 <= ... <  200 DM\r\n\t       A13 :      ... >= 200 DM / salary assignments for at least 1 year\r\n               A14 : no checking account\r\n\r\nAttribute 2:  (numerical)\r\n\t      Duration in month\r\n\r\nAttribute 3:  (qualitative)\r\n\t      Credit history\r\n\t      A30 : no credits taken/ all credits paid back duly\r\n              A31 : all credits at this bank paid back duly\r\n\t      A32 : existing credits paid back duly till now\r\n              A33 : delay in paying off in the past\r\n\t      A34 : critical account/  other credits existing (not at this bank)\r\n\r\nAttribute 4:  (qualitative)\r\n\t      Purpose\r\n\t      A40 : car (new)\r\n\t      A41 : car (used)\r\n\t      A42 : furniture/equipment\r\n\t      A43 : radio/television\r\n\t      A44 : domestic appliances\r\n\t      A45 : repairs\r\n\t      A46 : education\r\n\t      A47 : (vacation - does not exist?)\r\n\t      A48 : retraining\r\n\t      A49 : business\r\n\t      A410 : others\r\n\r\nAttribute 5:  (numerical)\r\n\t      Credit amount\r\n\r\nAttibute 6:  (qualitative)\r\n\t      Savings account/bonds\r\n\t      A61 :          ... <  100 DM\r\n\t      A62 :   100 <= ... <  500 DM\r\n\t      A63 :   500 <= ... < 1000 DM\r\n\t      A64 :          .. >= 1000 DM\r\n              A65 :   unknown/ no savings account\r\n\r\nAttribute 7:  (qualitative)\r\n\t      Present employment since\r\n\t      A71 : unemployed\r\n\t      A72 :       ... < 1 year\r\n\t      A73 : 1  <= ... < 4 years  \r\n\t      A74 : 4  <= ... < 7 years\r\n\t      A75 :       .. >= 7 years\r\n\r\nAttribute 8:  (numerical)\r\n\t      Installment rate in percentage of disposable income\r\n\r\nAttribute 9:  (qualitative)\r\n\t      Personal status and sex\r\n\t      A91 : male   : divorced/separated\r\n\t      A92 : female : divorced/separated/married\r\n              A93 : male   : single\r\n\t      A94 : male   : married/widowed\r\n\t      A95 : female : single\r\n\r\nAttribute 10: (qualitative)\r\n\t      Other debtors / guarantors\r\n\t      A101 : none\r\n\t      A102 : co-applicant\r\n\t      A103 : guarantor\r\n\r\nAttribute 11: (numerical)\r\n\t      Present residence since\r\n\r\nAttribute 12: (qualitative)\r\n\t      Property\r\n\t      A121 : real estate\r\n\t      A122 : if not A121 : building society savings agreement/ life insurance\r\n              A123 : if not A121/A122 : car or other, not in attribute 6\r\n\t      A124 : unknown / no property\r\n\r\nAttribute 13: (numerical)\r\n\t      Age in years\r\n\r\nAttribute 14: (qualitative)\r\n\t      Other installment plans \r\n\t      A141 : bank\r\n\t      A142 : stores\r\n\t      A143 : none\r\n\r\nAttribute 15: (qualitative)\r\n\t      Housing\r\n\t      A151 : rent\r\n\t      A152 : own\r\n\t      A153 : for free\r\n\r\nAttribute 16: (numerical)\r\n              Number of existing credits at this bank\r\n\r\nAttribute 17: (qualitative)\r\n\t      Job\r\n\t      A171 : unemployed/ unskilled  - non-resident\r\n\t      A172 : unskilled - resident\r\n\t      A173 : skilled employee / official\r\n\t      A174 : management/ self-employed/\r\n\t\t     highly qualified employee/ officer\r\n\r\nAttribute 18: (numerical)\r\n\t      Number of people being liable to provide maintenance for\r\n\r\nAttribute 19: (qualitative)\r\n\t      Telephone\r\n\t      A191 : none\r\n\t      A192 : yes, registered under the customers name\r\n\r\nAttribute 20: (qualitative)\r\n\t      foreign worker\r\n\t      A201 : yes\r\n\t      A202 : no\r\n",
        "citation": None
    }
}
ducknificient commented 11 months ago

Similar issue for Iris and Adult. Found any workarounds?

there are 3 possible workarounds :

  1. The Author update the metadata( intro_paper, etc)
  2. Add metadata type checking to avoid None, example is in #3 and #4
  3. Import the csv directly using the url

for number 3, the url is like this :

Iris Adults Statlog (German Credit Data)

ducknificient commented 11 months ago

fixed in a3d12a3d9f283ac6192c30e1dd8beb27af7f10ae