vzhou842 / profanity-check

A fast, robust Python library to check for offensive language in strings.
https://pypi.org/project/profanity-check
MIT License
612 stars 113 forks source link

accuracy degraded when using latest sklearn version #9

Open bobbui opened 4 years ago

bobbui commented 4 years ago

I tried to retrain using the train.py and exact the same train data file. but running the accuracy-test side by side between the model generated by sklearn 0.22 and the existing one (sklearn 0.20), the one from 0.22 perform significantly worse than the one from 0,20. Any idea why this happens? thanks.

dimitrismistriotis commented 4 years ago

Can confirm that this is what happens: https://github.com/dimitrismistriotis/profanity-check/blob/create_models_from_clean_data/profanity_check/train_models.py

with the following pytest output:

================================================================ test session starts ================================================================
platform linux -- Python 3.8.0, pytest-5.4.1, py-1.8.1, pluggy-0.13.1
rootdir: /home/dimitry/projects/profanity-check
collected 2 items                                                                                                                                   

tests/test_profanity_check.py F.                                                                                                              [100%]

===================================================================== FAILURES ======================================================================
___________________________________________________________________ test_accuracy ___________________________________________________________________

    def test_accuracy():
      texts = [
        'Hello there, how are you',
        'Lorem Ipsum is simply dummy text of the printing and typesetting industry.',
        '!!!! Click this now!!! -> https://example.com',
        'fuck you',
        'fUcK u',
        'GO TO hElL, you dirty scum',
      ]
>     assert list(predict(texts)) == [0, 0, 0, 1, 1, 1]
E     assert [0, 0, 0, 0, 0, 0] == [0, 0, 0, 1, 1, 1]
E       At index 3 diff: 0 != 1
E       Use -v to get the full diff
dimitrismistriotis commented 4 years ago

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Copy+paste of default values from current version(https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html):

 class sklearn.svm.LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)

Copy+paste from 0.20.4, closest version to one in this library (https://scikit-learn.org/0.20/modules/generated/sklearn.svm.LinearSVC.html):

class sklearn.svm.LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)

They are the same unless I am missing something. Perhaps we need to check if the implementation has changed. Another cause could be that the blog post code is not the one that ended up being used to generate the models.