uci-ml-repo / ucimlrepo

Python package for dataset imports from UCI ML Repository
MIT License
216 stars 90 forks source link

add metadata type checking #3

Closed ducknificient closed 1 year ago

ducknificient commented 1 year ago

The commit add type checking for metadata "intro_paper" and "additional_info" for future case when there's either none "intro_paper" or "additional_info"

example Abalone(id=1) has no metadata "intro_paper" and causing type error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-56-948113056312>](https://localhost:8080/#) in <cell line: 4>()
      2 
      3 # fetch dataset
----> 4 abalone = fetch_ucirepo(id=1)
      5 
      6 # data (as pandas dataframes)

[/usr/local/lib/python3.10/dist-packages/ucimlrepo/fetch.py](https://localhost:8080/#) in fetch_ucirepo(name, id)
    146     # make nested metadata fields accessible via dot notation
    147     metadata['additional_info'] = dotdict(metadata['additional_info'])
--> 148     metadata['intro_paper'] = dotdict(metadata['intro_paper'])
    149 
    150     # construct result object

TypeError: 'NoneType' object is not iterable

Abalone Metadata

{
  "uci_id": 1,
  "name": "Abalone",
  "repository_url": "https://archive.ics.uci.edu/dataset/1/abalone",
  "data_url": "https://archive.ics.uci.edu/static/public/1/data.csv",
  "abstract": "Predict the age of abalone from physical measurements",
  "area": "Life Science",
  "tasks": ["Classification", "Regression"],
  "characteristics": ["Tabular"],
  "num_instances": 4177,
  "num_features": 8,
  "feature_types": ["Categorical", "Integer", "Real"],
  "demographics": [],
  "target_col": ["Rings"],
  "index_col": null,
  "has_missing_values": "no",
  "missing_values_symbol": null,
  "year_of_dataset_creation": 1994,
  "last_updated": "Mon Aug 28 2023",
  "dataset_doi": "10.24432/C55C7W",
  "creators": [
    "Warwick Nash",
    "Tracy Sellers",
    "Simon Talbot",
    "Andrew Cawthorn",
    "Wes Ford"
  ],
  "intro_paper": null,
  "variables": [
    {
      "name": "Sex",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": "M, F, and I (infant)",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "Length",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "Longest shell measurement",
      "units": "mm",
      "missing_values": "no"
    },
    {
      "name": "Diameter",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "perpendicular to length",
      "units": "mm",
      "missing_values": "no"
    },
    {
      "name": "Height",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "with meat in shell",
      "units": "mm",
      "missing_values": "no"
    },
    {
      "name": "Whole_weight",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "whole abalone",
      "units": "grams",
      "missing_values": "no"
    },
    {
      "name": "Shucked_weight",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "weight of meat",
      "units": "grams",
      "missing_values": "no"
    },
    {
      "name": "Viscera_weight",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "gut weight (after bleeding)",
      "units": "grams",
      "missing_values": "no"
    },
    {
      "name": "Shell_weight",
      "role": "Feature",
      "type": "Continuous",
      "demographic": null,
      "description": "after being dried",
      "units": "grams",
      "missing_values": "no"
    },
    {
      "name": "Rings",
      "role": "Target",
      "type": "Integer",
      "demographic": null,
      "description": "+1.5 gives the age in years",
      "units": null,
      "missing_values": "no"
    }
  ],
  "additional_info": {
    "summary": "Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.  Other measurements, which are easier to obtain, are used to predict the age.  Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.\r\n\r\nFrom the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).",
    "purpose": null,
    "funded_by": null,
    "instances_represent": null,
    "recommended_data_splits": null,
    "sensitive_data": null,
    "preprocessing_description": null,
    "variable_info": "Given is the attribute name, attribute type, the measurement unit and a brief description.  The number of rings is the value to predict: either as a continuous value or as a classification problem.\r\n\r\nName / Data Type / Measurement Unit / Description\r\n-----------------------------\r\nSex / nominal / -- / M, F, and I (infant)\r\nLength / continuous / mm / Longest shell measurement\r\nDiameter\t/ continuous / mm / perpendicular to length\r\nHeight / continuous / mm / with meat in shell\r\nWhole weight / continuous / grams / whole abalone\r\nShucked weight / continuous\t / grams / weight of meat\r\nViscera weight / continuous / grams / gut weight (after bleeding)\r\nShell weight / continuous / grams / after being dried\r\nRings / integer / -- / +1.5 gives the age in years\r\n\r\nThe readme file contains attribute statistics.",
    "citation": null
  }
}

Heart Disease Metadata

{
  "uci_id": 45,
  "name": "Heart Disease",
  "repository_url": "https://archive.ics.uci.edu/dataset/45/heart+disease",
  "data_url": "https://archive.ics.uci.edu/static/public/45/data.csv",
  "abstract": "4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach",
  "area": "Life Science",
  "tasks": ["Classification"],
  "characteristics": ["Multivariate"],
  "num_instances": 303,
  "num_features": 13,
  "feature_types": ["Categorical", "Integer", "Real"],
  "demographics": ["Age", "Sex"],
  "target_col": ["num"],
  "index_col": null,
  "has_missing_values": "yes",
  "missing_values_symbol": "NaN",
  "year_of_dataset_creation": 1989,
  "last_updated": "Mon Aug 28 2023",
  "dataset_doi": "10.24432/C52P4X",
  "creators": [
    "Andras Janosi",
    "William Steinbrunn",
    "Matthias Pfisterer",
    "Robert Detrano"
  ],
  "intro_paper": {
    "title": "International application of a new probability algorithm for the diagnosis of coronary artery disease.",
    "authors": "R. Detrano, A. J\u00e1nosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, V. Froelicher",
    "published_in": "American Journal of Cardiology",
    "year": 1989,
    "url": "https://www.semanticscholar.org/paper/a7d714f8f87bfc41351eb5ae1e5472f0ebbe0574",
    "doi": null
  },
  "variables": [
    {
      "name": "age",
      "role": "Feature",
      "type": "Integer",
      "demographic": "Age",
      "description": null,
      "units": "years",
      "missing_values": "no"
    },
    {
      "name": "sex",
      "role": "Feature",
      "type": "Categorical",
      "demographic": "Sex",
      "description": null,
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "cp",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": null,
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "trestbps",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "resting blood pressure (on admission to the hospital)",
      "units": "mm Hg",
      "missing_values": "no"
    },
    {
      "name": "chol",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "serum cholestoral",
      "units": "mg/dl",
      "missing_values": "no"
    },
    {
      "name": "fbs",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": "fasting blood sugar > 120 mg/dl",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "restecg",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": null,
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "thalach",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "maximum heart rate achieved",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "exang",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": "exercise induced angina",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "oldpeak",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "ST depression induced by exercise relative to rest",
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "slope",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": null,
      "units": null,
      "missing_values": "no"
    },
    {
      "name": "ca",
      "role": "Feature",
      "type": "Integer",
      "demographic": null,
      "description": "number of major vessels (0-3) colored by flourosopy",
      "units": null,
      "missing_values": "yes"
    },
    {
      "name": "thal",
      "role": "Feature",
      "type": "Categorical",
      "demographic": null,
      "description": null,
      "units": null,
      "missing_values": "yes"
    },
    {
      "name": "num",
      "role": "Target",
      "type": "Integer",
      "demographic": null,
      "description": "diagnosis of heart disease",
      "units": null,
      "missing_values": "no"
    }
  ],
  "additional_info": {
    "summary": "This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.  In particular, the Cleveland database is the only one that has been used by ML researchers to date.  The \"goal\" field refers to the presence of heart disease in the patient.  It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).  \n   \nThe names and social security numbers of the patients were recently removed from the database, replaced with dummy values.\n\nOne file has been \"processed\", that one containing the Cleveland database.  All four unprocessed files also exist in this directory.\n\nTo see Test Costs (donated by Peter Turney), please see the folder \"Costs\" ",
    "purpose": null,
    "funded_by": null,
    "instances_represent": null,
    "recommended_data_splits": null,
    "sensitive_data": null,
    "preprocessing_description": null,
    "variable_info": "Only 14 attributes used:\r\n      1. #3  (age)       \r\n      2. #4  (sex)       \r\n      3. #9  (cp)        \r\n      4. #10 (trestbps)  \r\n      5. #12 (chol)      \r\n      6. #16 (fbs)       \r\n      7. #19 (restecg)   \r\n      8. #32 (thalach)   \r\n      9. #38 (exang)     \r\n      10. #40 (oldpeak)   \r\n      11. #41 (slope)     \r\n      12. #44 (ca)        \r\n      13. #51 (thal)      \r\n      14. #58 (num)       (the predicted attribute)\r\n\r\nComplete attribute documentation:\r\n      1 id: patient identification number\r\n      2 ccf: social security number (I replaced this with a dummy value of 0)\r\n      3 age: age in years\r\n      4 sex: sex (1 = male; 0 = female)\r\n      5 painloc: chest pain location (1 = substernal; 0 = otherwise)\r\n      6 painexer (1 = provoked by exertion; 0 = otherwise)\r\n      7 relrest (1 = relieved after rest; 0 = otherwise)\r\n      8 pncaden (sum of 5, 6, and 7)\r\n      9 cp: chest pain type\r\n        -- Value 1: typical angina\r\n        -- Value 2: atypical angina\r\n        -- Value 3: non-anginal pain\r\n        -- Value 4: asymptomatic\r\n     10 trestbps: resting blood pressure (in mm Hg on admission to the hospital)\r\n     11 htn\r\n     12 chol: serum cholestoral in mg/dl\r\n     13 smoke: I believe this is 1 = yes; 0 = no (is or is not a smoker)\r\n     14 cigs (cigarettes per day)\r\n     15 years (number of years as a smoker)\r\n     16 fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)\r\n     17 dm (1 = history of diabetes; 0 = no such history)\r\n     18 famhist: family history of coronary artery disease (1 = yes; 0 = no)\r\n     19 restecg: resting electrocardiographic results\r\n        -- Value 0: normal\r\n        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)\r\n        -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria\r\n     20 ekgmo (month of exercise ECG reading)\r\n     21 ekgday(day of exercise ECG reading)\r\n     22 ekgyr (year of exercise ECG reading)\r\n     23 dig (digitalis used furing exercise ECG: 1 = yes; 0 = no)\r\n     24 prop (Beta blocker used during exercise ECG: 1 = yes; 0 = no)\r\n     25 nitr (nitrates used during exercise ECG: 1 = yes; 0 = no)\r\n     26 pro (calcium channel blocker used during exercise ECG: 1 = yes; 0 = no)\r\n     27 diuretic (diuretic used used during exercise ECG: 1 = yes; 0 = no)\r\n     28 proto: exercise protocol\r\n          1 = Bruce     \r\n          2 = Kottus\r\n          3 = McHenry\r\n          4 = fast Balke\r\n          5 = Balke\r\n          6 = Noughton \r\n          7 = bike 150 kpa min/min  (Not sure if \"kpa min/min\" is what was written!)\r\n          8 = bike 125 kpa min/min  \r\n          9 = bike 100 kpa min/min\r\n         10 = bike 75 kpa min/min\r\n         11 = bike 50 kpa min/min\r\n         12 = arm ergometer\r\n     29 thaldur: duration of exercise test in minutes\r\n     30 thaltime: time when ST measure depression was noted\r\n     31 met: mets achieved\r\n     32 thalach: maximum heart rate achieved\r\n     33 thalrest: resting heart rate\r\n     34 tpeakbps: peak exercise blood pressure (first of 2 parts)\r\n     35 tpeakbpd: peak exercise blood pressure (second of 2 parts)\r\n     36 dummy\r\n     37 trestbpd: resting blood pressure\r\n     38 exang: exercise induced angina (1 = yes; 0 = no)\r\n     39 xhypo: (1 = yes; 0 = no)\r\n     40 oldpeak = ST depression induced by exercise relative to rest\r\n     41 slope: the slope of the peak exercise ST segment\r\n        -- Value 1: upsloping\r\n        -- Value 2: flat\r\n        -- Value 3: downsloping\r\n     42 rldv5: height at rest\r\n     43 rldv5e: height at peak exercise\r\n     44 ca: number of major vessels (0-3) colored by flourosopy\r\n     45 restckm: irrelevant\r\n     46 exerckm: irrelevant\r\n     47 restef: rest raidonuclid (sp?) ejection fraction\r\n     48 restwm: rest wall (sp?) motion abnormality\r\n        0 = none\r\n        1 = mild or moderate\r\n        2 = moderate or severe\r\n        3 = akinesis or dyskmem (sp?)\r\n     49 exeref: exercise radinalid (sp?) ejection fraction\r\n     50 exerwm: exercise wall (sp?) motion \r\n     51 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect\r\n     52 thalsev: not used\r\n     53 thalpul: not used\r\n     54 earlobe: not used\r\n     55 cmo: month of cardiac cath (sp?)  (perhaps \"call\")\r\n     56 cday: day of cardiac cath (sp?)\r\n     57 cyr: year of cardiac cath (sp?)\r\n     58 num: diagnosis of heart disease (angiographic disease status)\r\n        -- Value 0: < 50% diameter narrowing\r\n        -- Value 1: > 50% diameter narrowing\r\n        (in any major vessel: attributes 59 through 68 are vessels)\r\n     59 lmt\r\n     60 ladprox\r\n     61 laddist\r\n     62 diag\r\n     63 cxmain\r\n     64 ramus\r\n     65 om1\r\n     66 om2\r\n     67 rcaprox\r\n     68 rcadist\r\n     69 lvx1: not used\r\n     70 lvx2: not used\r\n     71 lvx3: not used\r\n     72 lvx4: not used\r\n     73 lvf: not used\r\n     74 cathef: not used\r\n     75 junk: not used\r\n     76 name: last name of patient  (I replaced this with the dummy string \"name\")",
    "citation": null
  }
}
ducknificient commented 1 year ago

fixed in a3d12a3d9f283ac6192c30e1dd8beb27af7f10ae