microsoft / hummingbird

Hummingbird compiles trained ML models into tensor computation for faster inference.
MIT License
3.32k stars 274 forks source link

[sklearn] OneHotEncoder does't work correctly #684

Open faterazer opened 1 year ago

faterazer commented 1 year ago

Hello, I found this project last week, and thanks for all of these work.

I installed Hummingbird-ml==0.47 by pip, and I want to know which version of sklearn should I use.

I want to use one-hot encoder of sklearn to preprocess my categorical features, but the result's dim of sklearn is different from the dim of converted pytorch model. For sklearn, 15 features -> 69 dim,but for converted pytorch mdoel, 15 features -> 76 dim.

After my check, I'm sure the problem is the argument of sklearn's OneHotEncoder:

Changed in version 1.1: 'infrequent_if_exist' was added to automatically handle unknown categories and infrequent categories.

Is there any way to solve this problem?Thanks for any solution!

ksaur commented 1 year ago

Hi @faterazer, thanks for reaching out! We use whatever the most current version of SKL is, so right now 1.2.1.

Was your model trained on the same version of scikit-learn that you're trying to use Hummingbird with? Just trying to make sure it's not a simple fix. (Lots of times, users have issues if the model is trained with an older version of SKL and then they call Hummingbird on a saved model.)

Can you post a little bit of your code so we can take a look? Maybe we need to add the new field.

faterazer commented 1 year ago

Hi, so appreciated your suggestions, I read the letter and checked through my operations. Unfortunately, the problem still exists. I guess providing more details could be convenient for you to locate the problem. So I post my code and test data, and they are all in test.zip. Now, let me describe my processing flow:

1. In test.zip, I constructed some data for test, they all categorical features, fifteen columns in total. I saved data as test/test.csv .

  1. For some reasons, I need to cross the conda environments. At first, I use a conda environment, which includes python 3.10, sci-kit learn 1.2.1, and does not include hummingbird-ml. I construct an OneHotEncoder of sklearn, and then fit the test data. Finally, I save the encoder/pipeline as a binary file by pickle. You could find the code in test/A.py .
  2. Then, I use another conda environment, which includes python 3.8, sci-kit learn 1.2.1, and hummingbird-ml 0.4.7. I load my sklearn preprocessor from the binary file by pickle, and then use hummingbird-ml to covert it. Finally, I check the outputs from sklearn and hummingbrid-ml, however, the shapes are different. You could find the code in test/B.py.
  3. I found that if I modify the code on line 16 of test/A.py. From OneHotEncoder(sparse_output=False, handle_unknown="infrequent_if_exist", min_frequency=0.005) to OneHotEncoder(sparse_output=False, handle_unknown="ignore"), then everything is ok. I found the changelog of sklearn, it said since version 1.1, sklearn provides the new choice of handle_unknown, which I would like to use but caused the problem.

Could you look into my operations and codes? Did I make a mistake in any step? Or is there a solution to fix the problem? I appreciate your reading and efforts.

Thanks again for all your work in hummingbird-ml. It's an awesome project, and I hope I could use it all the time.

Yours sincerely, faterazer


发件人: Karla Saur @.> 发送时间: 2023年2月10日 4:47 收件人: microsoft/hummingbird @.> 抄送: fater @.>; Mention @.> 主题: Re: [microsoft/hummingbird] [sklearn] OneHotEncoder does't work correctly (Issue #684)

Hi @faterazerhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffaterazer&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LQMqlDk9H7kSbEwZB2hloKbLmfkTsCQqReSC2kREe8U%3D&reserved=0, thanks for reaching out! We use whatever the most current version of SKL is, so right now 1.2.1.

Was your model trained on the same version of scikit-learn that you're trying to use Hummingbird with? Just trying to make sure it's not a simple fix. (Lots of times, users have issues if the model is trained with an older version of SKL and then they call Hummingbird on a saved model.)

Can you post a little bit of your code so we can take a look? Maybe we need to add the new field.

― Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fhummingbird%2Fissues%2F684%23issuecomment-1424813248&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Op7tq2w8p4yPrT7Dfspe9IrXWX4MxvkVq3GzhEQ0X3s%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADIJWXPKKYMTUS3NO7SBTOLWWVJXNANCNFSM6AAAAAAUWROEPA&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9OSe%2BWzec7QbtxCwlk%2B5x2pTr2mOWg4kKjAnJDEGtvQ%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

ksaur commented 1 year ago

Hello! I think that the attachment (test.zip) got dropped. If it's easier, you could check them into a fork in github and put a link!

faterazer commented 1 year ago

Hello! I think that the attachment (test.zip) got dropped. If it's easier, you could check them into a fork in github and put a link! test.zip How about this time? I reply directly through Github.

ksaur commented 1 year ago

Thank you for your in-depth example with details! I was able to reproduce everything you said.

Yes it looks like we need to add this feature to the list of supported options (and we should at least be putting an error for ones we don't support). We'll add that to the queue!

faterazer commented 1 year ago

So glad my example helped. I really hope that the problem could be solved in the near future. Thanks your efforts. 🙂


发件人: Karla Saur @.> 发送时间: 2023年2月15日 8:54 收件人: microsoft/hummingbird @.> 抄送: fater @.>; Mention @.> 主题: Re: [microsoft/hummingbird] [sklearn] OneHotEncoder does't work correctly (Issue #684)

Thank you for your in-depth example with details! I was able to reproduce everything you said.

Yes it looks like we need to add this feature to the list of supported options (and we should at least be putting an error for ones we don't support). We'll add that to the queue!

— Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fhummingbird%2Fissues%2F684%23issuecomment-1430596819&data=05%7C01%7C%7C0d6ed3660585437bdd1b08db0eef3aba%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638120192875740336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=21VB5RdPUqcpu1R%2FOUE%2FQPnLaDKk8mLEVjnrgys4e3o%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADIJWXIKM3SU35SDXSV23HLWXQSNJANCNFSM6AAAAAAUWROEPA&data=05%7C01%7C%7C0d6ed3660585437bdd1b08db0eef3aba%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638120192875896584%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jkHWjTpAs1PiI9g%2FBgIRIDfY8MsersmFWT%2FTQRAk7Pc%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>