Can't get it working like in Quick-Start

clausnizer-ondics commented 3 years ago

igel version: 0.3.1 (latest pip)
Python version: 3.8.5
Operating System: docker on top of Ubuntu 16.04.6 LTS (4.4.0)

Description

Very new to ML, don't know what and how to do something with the Igel. I followed the Quick-Start Demo to get an Idea.

Installed Igel
Downloaded the archive.zip from https://www.kaggle.com/uciml/pima-indians-diabetes-database and put diabetes.csv in working Folder
Followed Quick-Start

Resulted in this igel.yaml:

dataset:
  preprocess:
    missing_values: mean
    scale:
      method: standard
      target: inputs
  split:
    shuffle: true
    test_size: 0.1
  type: csv
# model definition
model:
    # in the type field, you can write the type of problem you want to solve. Whether regression, classification or clustering
    # Then, provide the algorithm you want to use on the data. Here I'm using the random forest algorithm
    type: classification
    algorithm: RandomForest     # make sure you write the name of the algorithm in pascal case
    arguments:
        n_estimators: 100   # here, I set the number of estimators (or trees) to 100
        max_depth: 30       # set the max_depth of the tree

# target you want to predict
# Here, as an example, I'm using the famous indians-diabetes dataset, where I want to predict whether someone have diabetes or not.
# Depending on your data, you need to provide the target(s) you want to predict here
target:
    - sick

What I Did

... with having a big question mark above my head:

$ igel fit -dp 'diabetes.csv' -yml 'igel.yaml' 

         _____          _       _
        |_   _| __ __ _(_)_ __ (_)_ __   __ _
          | || '__/ _` | | '_ \| | '_ \ / _` |
          | || | | (_| | | | | | | | | | (_| |
          |_||_|  \__,_|_|_| |_|_|_| |_|\__, |
                                        |___/

INFO - Entered CLI args: {'data_path': 'diabetes.csv', 'yaml_path': 'igel.yaml', 'cmd': 'fit'}
INFO - Executing command: fit ...
INFO - reading data from diabetes.csv
INFO - You passed the configurations as a yaml file.
INFO - your chosen configuration: {'dataset': {'preprocess': {'missing_values': 'mean', 'scale': {'method': 'standard', 'target': 'inputs'}}, 'split': {'shuffle': True, 'test_size': 0.1}, 'type': 'csv'}, 'model': {'type': 'classification', 'algorithm': 'RandomForest', 'arguments': {'n_estimators': 100, 'max_depth': 30}}, 'target': ['sick']}
INFO - dataset_props: {'preprocess': {'missing_values': 'mean', 'scale': {'method': 'standard', 'target': 'inputs'}}, 'split': {'shuffle': True, 'test_size': 0.1}, 'type': 'csv'} 
model_props: {'type': 'classification', 'algorithm': 'RandomForest', 'arguments': {'n_estimators': 100, 'max_depth': 30}} 
 target: ['sick'] 

INFO - dataset shape: (768, 9)
INFO - dataset attributes: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
INFO - Check for missing values in the dataset ...  
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64  
 ----------------------------------------------------------------------------------------------------
INFO - shape of the dataset after handling missing values => (768, 9)
ERROR - error occured while preparing the data: ('chosen target(s) to predict must exist in the dataset',)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 245, in _process_data
    raise Exception("chosen target(s) to predict must exist in the dataset")
Exception: chosen target(s) to predict must exist in the dataset
Traceback (most recent call last):
  File "/opt/conda/bin/igel", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 508, in main
    CLI()
  File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 166, in __init__
    getattr(self, self.cmd.command)()
  File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 297, in fit
    Igel(**self.dict_args)
  File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 102, in __init__
    getattr(self, self.command)()
  File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 336, in fit
    x_train, y_train, x_test, y_test = self._prepare_fit_data()
TypeError: cannot unpack non-iterable NoneType object

If I understand right, the Igel want's to have a column named sick in dataset.csv. So there is a missing link and I have no idea how to close this.

Can you provide test-data, maybe as part of this repo, to get something to work? Or help me finding the missing part?

Please help

red-eyed-tree-frog commented 3 years ago

If I infer correctly you must first use some heuristic and categorise your data into sick / not sick. Maybe add a column called sick where 0 = not sick and 1 = sick? Your heuristic could be age > x and blood pressure > y then sick as a simplified example.

clausnizer-ondics commented 3 years ago

Thanks for your help. If this will give me a working example it would be fine! :-) I know how to add a column sick to the diabetes.csv but I have no Idea where to put the heuristic stuff. ;-)

Can you provide a step by step guide on how to do this?

red-eyed-tree-frog commented 3 years ago

Actually I think the column you are looking for is 'outcome' not 'sick'. Use that as your target instead.

nidhaloff commented 3 years ago

@Anenizer Hi, the data that I used in the docs can be found in this repo. Please go to the examples folder and then check the datasets under the data folder. Or simply click here.

Now coming to your issue. Notice that there are multiple versions of the famous Indian-diabetes dataset. If you visit kaggel, you will find many version of it, each having different attributes/feature names. The one I'm using here has an attribute called sick, which indicates whether a patient sick or not (0 means not sick and 1 means sick). The trick is if you are using a dataset with other attribute names then you will have to provide what you want to predict in the target field inside the .yaml file. Simply put, if the name of the attribute in your dataset is let's say "patient-status" instead of sick, then you have to provide:

target:
     - patient-status

in your .yaml file. This way igel will recognize that you want to predict the patient-status from your dataset. Hope this was helpful ;)

nidhaloff commented 3 years ago

@Anenizer does this answer your question? if not feel free to re-open the issue or create a new one if you have other questions

clausnizer-ondics commented 3 years ago

Yes, I now have a working example, this was my goal. Thank you very much!

nidhaloff / igel

Can't get it working like in Quick-Start #54

Description

What I Did