minimaxir / automl-gs

Provide an input CSV and a target field to predict, generate a model + code to run it.
MIT License
1.85k stars 178 forks source link

bin edges must be unique #23

Open GriffinRidgeback opened 5 years ago

GriffinRidgeback commented 5 years ago

Hello - I am trying to use this package to provide predictions for my Data Science Capstone project. When I run against my training data, I get the following exception/error:

raceback (most recent call last): | 0/20 [00:00<?, ?epoch/s] File "model.py", line 63, in model_train(df, encoders, args, model) File "C:\Users\deliak\Documents\Jupyter Notebooks\edX\DAT102x -Microsoft Professional Capstone Data Science\automl_train\pipeline.py", line 903, in model_train X, y = process_data(df, encoders) File "C:\Users\deliak\Documents\Jupyter Notebooks\edX\DAT102x -Microsoft Professional Capstone Data Science\automl_train\pipeline.py", line 758, in process_data df['msa_md'].values, encoders['msa_md_bins'], labels=False, include_lowest=True) File "C:\Users\deliak\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\tile.py", line 234, in cut duplicates=duplicates) File "C:\Users\deliak\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\tile.py", line 332, in _bins_to_cuts "the 'duplicates' kwarg".format(bins=bins)) ValueError: Bin edges must be unique: array([ -1., -1., 18., 63., 118., 192., 247., 305., 329., 371., 408.]). You can drop duplicate edges by setting the 'duplicates' kwarg Traceback (most recent call last): | 0/20 [00:00<?, ?epoch/s] File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\deliak\AppData\Local\Continuum\anaconda3\Scripts\automl_gs.exe__main.py", line 9, in File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\automl_gs\automl_gs.py", line 175, in cmd tpu_address=args.tpu_address) File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\automl_gs\automl_gs.py", line 87, in automl_grid_search "metadata", "results.csv")) File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f return _read(filepath_or_buffer, kwds) File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 440, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 787, in init self._make_engine(self.engine) File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1014, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "c:\users\deliak\appdata\local\continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1708, in init__ self._reader = parsers.TextReader(src, **kwds) File "pandas_libs\parsers.pyx", line 384, in pandas._libs.parsers.TextReader.cinit File "pandas_libs\parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source FileNotFoundError: File b'automl_train\metadata\results.csv' does not exist

tresoldi commented 5 years ago

I am running into the same issue. The edges problem can be solved by instructing pandas to drop duplicates (add argument duplicates="drop" to the pd.cut call in templates/processors/numeric), but of course it probably means that the problem is in the data itself.

Not sure what the developers could to automatize in this case -- maybe call sklearn Inputer or (in my case) just fill the NAs?

GriffinRidgeback commented 5 years ago

Thank you!  I will give that a try.  I did call fillna() on the dataframe before passing the csv to the tool;  guess that wasn't enough.

-----Original Message----- From: Tiago Tresoldi notifications@github.com To: minimaxir/automl-gs automl-gs@noreply.github.com Cc: Griffin kevindelia@verizon.net; Author author@noreply.github.com Sent: Thu, Apr 11, 2019 10:48 am Subject: Re: [minimaxir/automl-gs] bin edges must be unique (#23)

I am running into the same issue. The edges problem can be solved by instructing pandas to drop duplicates (add argument duplicates="drop" to the pd.cut call in templates/processors/numeric), but of course it probably means that the problem is in the data itself.Not sure what the developers could to automatize in this case -- maybe call sklearn Inputer or (in my case) just fill the NAs?— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

GriffinRidgeback commented 5 years ago

Well that fixed it but now I get this error:

ValueError: Error when checking input: expected input_loan_type to have shape (1,) but got array with shape (2,)

when I check this attribute, I get this:

train_data.loan_type.unique()

array([3, 1, 2, 4], dtype=int64)

Should I open a separate ticket for this?

And thank you for getting me a little bit further

avinregmi commented 5 years ago

I'm having the same issue. Did you solve it?

GriffinRidgeback commented 5 years ago

I did not. I used the xgboost algorithm instead. That ran to completion but I didn't get the output I expected. I thought I would get 1's and 0's but got probabilities instead which wasn't acceptable to what I had to submit for my course project.

Good luck!

germanjoey commented 5 years ago

@avinregmi Sounds similar to my problem here: #25.

gagandeep44489 commented 1 month ago

Possible Causes and Solutions Duplicate Values in Data:

Cause: If the data you're binning contains duplicate values, and these duplicates coincide with the bin edges, it can cause this error. Solution: Clean your data to remove or handle duplicates before binning. You can use pandas to drop duplicates or adjust your bin edges slightly to avoid coinciding with duplicate values. Bin Edges Overlap or Too Close:

Cause: If your bin edges are very close to each other, floating-point precision errors might cause them to be treated as non-unique. Solution: Increase the distance between bin edges or use a smaller number of bins. Incorrect Bin Edge Calculation:

Cause: If you're manually calculating bin edges and there's a mistake in the logic, it can result in duplicate edges. Solution: Double-check the logic used to generate bin edges. Use functions like numpy.linspace() to ensure evenly spaced bin edges without duplicates. Floating-Point Precision Issues:

Cause: When bin edges are calculated using floating-point arithmetic, very small differences might not be distinguishable, leading to apparent duplicates. Solution: Round your bin edges to a certain decimal place or use integer-based binning if applicable. Example Soluti