nickkunz / smogn

Synthetic Minority Over-Sampling Technique for Regression
https://pypi.org/project/smogn
GNU General Public License v3.0
316 stars 78 forks source link

IndexError: indices are out-of-bounds #10

Open mbeyeler opened 4 years ago

mbeyeler commented 4 years ago

Hi Nick,

Great package!

I just ran into an IndexError when the DataFrame index values are not from a RangeIndex. I would imagine this to happen quite often if the user passes in training data from a shuffled train-test split.

Code to reproduce the error:

import pandas as pd
import smogn
housing = pd.read_csv('https://raw.githubusercontent.com/nickkunz/smogn/master/data/housing.csv')
smogn.smoter(housing[housing.index > 10], 'SalePrice')

smogn.smoter(housing[housing.index > 10].reset_index(), 'SalePrice') fixes it, but is not necessarily desirable because I would like (need) to preserve the original index.

Best, Michael

nickkunz commented 4 years ago

@mbeyeler Hello and thank you for raising this issue. It is an important use case, especially in the scenario where the data is train and test set split. I have made a note to address it for future builds!

sherryxiaa commented 4 years ago

I have the "IndexError: list assignment index out of range" error but reseting the index did not seem to solve the issue

BrutishGuy commented 4 years ago

I am having same issue as @sherryxiaa . It seems to work with the housing dataset provided in the examples, but not on my own dataframes. I have tried to perform a reset_index() operation but this does not fix it. I can attempt to reproduce this using the housing dataset so that you too can investigate.

jesperbruunhansen commented 4 years ago

I was having the same error but after i did a df.reset_index(drop=True) the error got away. Does this help you?

Ex:

df_smogn = smogn.smoter(data=df.reset_index(drop=True), y="my_y_col")
kevalshah90 commented 2 years ago

I am running into this issue as well. My index is RangeIndex(start=0, stop=1857, step=1)

I tried the approaches in this thread but none of them worked for me.

df_ml_smogn = smogn.smoter(

    ## main arguments
    data = df_ml.reset_index(drop=True),    ## pandas dataframe
    y = 'Revenue_per_sqft_month',              ## string ('header name')
    k = 7,                                                          ## positive integer (k < n)
    samp_method = 'extreme',                      ## string ('balance' or 'extreme')
    drop_na_col = True,
    drop_na_row = True,

    ## phi relevance arguments
    replace = True,                  ## sampling with replacement 
    rel_thres = 0.75,                ## positive real number (0 < R < 1)
    rel_method = 'auto',             ## string ('auto' or 'manual')
    rel_xtrm_type = 'high',          ## string ('low' or 'both' or 'high')
    rel_coef = 2.25                  ## positive real number (0 < R)

)

@mbeyeler @jesperbruunhansen @BrutishGuy @sherryxiaa were you able to fix this issue?

kevalshah90 commented 2 years ago

@nickkunz any thoughts on this?

jellis-ventiv commented 2 years ago

I have encountered 2 types of indexing errors, one out-of-bounds which I was able to fix using index_reset() and another error "index out of range" as indicated in a couple of the posts above. That one I have not been able to work around. I encounter this error here:

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\smogn\smoter.py:240, in smoter(data, y, k, pert, samp_method, under_samp, drop_na_col, drop_na_row, replace, rel_thres, rel_method, rel_xtrm_type, rel_coef, rel_ctrl_pts_rg) 234 ## over-sampling 235 if s_perc[i] > 1: 236
237 ## generate synthetic observations in training set 238 ## considered 'minority' 239 ## (see 'over_sampling()' function for details) --> 240 synth_obs = over_sampling( 241 data = data, 242 index = list(b_index[i].index), 243 perc = s_perc[i], 244 pert = pert, 245 k = k 246 ) 248 ## concatenate over-sampling ... 75 ## distance equals 1 for values that are not equal 76 else: 77 dist[i] = 1

IndexError: list assignment index out of range

Any thoughts on how to fix/work around it are appreciated.

knlpscience commented 8 months ago

I also encountered the same issue, and upon checking, it seems that the problem lies in the dist_metrics.py file. In the heom_dist function, the dist list is initialized with [None] * d_num, which causes an IndexError: list assignment index out of range if the dataset has more categorical variables than numerical ones. Therefore, when initializing dist, both d_num and d_nom should be considered. For those in a hurry, there is a fix in pull request #48 that someone has made, which could be useful to check out.