Open mbeyeler opened 4 years ago
@mbeyeler Hello and thank you for raising this issue. It is an important use case, especially in the scenario where the data is train and test set split. I have made a note to address it for future builds!
I have the "IndexError: list assignment index out of range" error but reseting the index did not seem to solve the issue
I am having same issue as @sherryxiaa . It seems to work with the housing dataset provided in the examples, but not on my own dataframes. I have tried to perform a reset_index() operation but this does not fix it. I can attempt to reproduce this using the housing dataset so that you too can investigate.
I was having the same error but after i did a df.reset_index(drop=True)
the error got away. Does this help you?
Ex:
df_smogn = smogn.smoter(data=df.reset_index(drop=True), y="my_y_col")
I am running into this issue as well. My index is RangeIndex(start=0, stop=1857, step=1)
I tried the approaches in this thread but none of them worked for me.
df_ml_smogn = smogn.smoter(
## main arguments
data = df_ml.reset_index(drop=True), ## pandas dataframe
y = 'Revenue_per_sqft_month', ## string ('header name')
k = 7, ## positive integer (k < n)
samp_method = 'extreme', ## string ('balance' or 'extreme')
drop_na_col = True,
drop_na_row = True,
## phi relevance arguments
replace = True, ## sampling with replacement
rel_thres = 0.75, ## positive real number (0 < R < 1)
rel_method = 'auto', ## string ('auto' or 'manual')
rel_xtrm_type = 'high', ## string ('low' or 'both' or 'high')
rel_coef = 2.25 ## positive real number (0 < R)
)
@mbeyeler @jesperbruunhansen @BrutishGuy @sherryxiaa were you able to fix this issue?
@nickkunz any thoughts on this?
I have encountered 2 types of indexing errors, one out-of-bounds which I was able to fix using index_reset() and another error "index out of range" as indicated in a couple of the posts above. That one I have not been able to work around. I encounter this error here:
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\smogn\smoter.py:240, in smoter(data, y, k, pert, samp_method, under_samp, drop_na_col, drop_na_row, replace, rel_thres, rel_method, rel_xtrm_type, rel_coef, rel_ctrl_pts_rg)
234 ## over-sampling
235 if s_perc[i] > 1:
236
237 ## generate synthetic observations in training set
238 ## considered 'minority'
239 ## (see 'over_sampling()' function for details)
--> 240 synth_obs = over_sampling(
241 data = data,
242 index = list(b_index[i].index),
243 perc = s_perc[i],
244 pert = pert,
245 k = k
246 )
248 ## concatenate over-sampling
...
75 ## distance equals 1 for values that are not equal
76 else:
77 dist[i] = 1
IndexError: list assignment index out of range
Any thoughts on how to fix/work around it are appreciated.
I also encountered the same issue, and upon checking, it seems that the problem lies in the dist_metrics.py file. In the heom_dist function, the dist list is initialized with [None] * d_num
, which causes an IndexError: list assignment index out of range
if the dataset has more categorical variables than numerical ones. Therefore, when initializing dist, both d_num and d_nom should be considered. For those in a hurry, there is a fix in pull request #48 that someone has made, which could be useful to check out.
Hi Nick,
Great package!
I just ran into an
IndexError
when the DataFrame index values are not from aRangeIndex
. I would imagine this to happen quite often if the user passes in training data from a shuffled train-test split.Code to reproduce the error:
smogn.smoter(housing[housing.index > 10].reset_index(), 'SalePrice')
fixes it, but is not necessarily desirable because I would like (need) to preserve the original index.Best, Michael