umcu / negation-detection

Negation detection in Dutch clinical text.
GNU General Public License v3.0
3 stars 0 forks source link

Experimented with adding space at start of sentence #7

Closed sandertan closed 3 years ago

sandertan commented 3 years ago

Weird behavior with/without a space at start of sentence and ByteLevelBPETokenizer(add_prefix_space=True)

image

# ByteLevelBPETokenizer()
Epoch: 9 **************************************************  Train
              precision    recall  f1-score   support

           0       0.93      0.82      0.87      1591
           1       0.97      0.99      0.98      9758

    accuracy                           0.97     11349
   macro avg       0.95      0.91      0.93     11349
weighted avg       0.97      0.97      0.97     11349

Epoch: 9 **************************************************  Test
              precision    recall  f1-score   support

           0       0.90      0.78      0.84       182
           1       0.96      0.99      0.97      1080

    accuracy                           0.96      1262
   macro avg       0.93      0.88      0.91      1262
weighted avg       0.96      0.96      0.96      1262

Train Loss: 0.12445199813260893
Test Loss:  0.14788185196812265
# ByteLevelBPETokenizer(add_prefix_space=True)
Epoch: 9 **************************************************  Train
              precision    recall  f1-score   support

           0       0.93      0.81      0.86      1591
           1       0.97      0.99      0.98      9758

    accuracy                           0.96     11349
   macro avg       0.95      0.90      0.92     11349
weighted avg       0.96      0.96      0.96     11349

Epoch: 9 **************************************************  Test
              precision    recall  f1-score   support

           0       0.93      0.81      0.87       182
           1       0.97      0.99      0.98      1080

    accuracy                           0.96      1262
   macro avg       0.95      0.90      0.92      1262
weighted avg       0.96      0.96      0.96      1262

Train Loss: 0.12479806886854726
Test Loss:  0.14569044386735186
myrthemh commented 3 years ago

Interesting, because when I set add_prefix_space=True I get the following: Screenshot 2021-06-17 at 09 27 49

Although the output is incorrect, it seems to handle spaces at the beginning of the sentence correctly.

sandertan commented 3 years ago

You're right, the add_prefix_space=True was not set when loading the tokenizer from file. Added functionality to MetaCAT for that https://github.com/CogStack/MedCAT/pull/75 . I get the same results now.

sandertan commented 3 years ago

Change was merged into MedCAT master. If you pull that, you will be able to run this code.