midas-research / dlkp

A deep learning library for identifying keyphrases from text
MIT License
25 stars 3 forks source link

Model Predictions has Issues with merging sub-words #20

Closed debanjanbhucs closed 2 years ago

debanjanbhucs commented 2 years ago

The model prediction seems to have a bug and does not properly deal with the sub-words in the output.

For example this is the output obtained:

[[' representation', ' documents', ' mask', 'ing strategies', ' transformer language models', ' discrim', 'in', ' gener', 'ative', ' discrim', 'in', 'ative', ' Key', 'phrase Boundary Infilling with Replacement', 'K', 'BI', ' key', 'phrase extraction', ' gener', 'ative', ' BART', ' Key', 'B', 'ART', ' Cat', 'Se', 'q', ' key', ' generation', ' named entity recognition', ' answering', ' relation extraction', ' abstract', 'ive summarization']]

After executing the following code:

tagger = KeyphraseTagger.load(
    model_name_or_path="../../outputs"
    )

input_text = "In this work, we explore how to learn task-specific language models aimed towards learning rich " \
             "representation of keyphrases from text documents. We experiment with different masking strategies for " \
             "pre-training transformer language models (LMs) in discriminative as well as generative settings. In the " \
             "discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with " \
             "Replacement (KBIR), showing large gains in performance (upto 9.26 points in F1) over SOTA, when LM " \
             "pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we " \
             "introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the " \
             "input text in the CatSeq format, instead of the denoised original input. This also led to gains in " \
             "performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also " \
             "fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), " \
             "relation extraction (RE), abstractive summarization and achieve comparable performance with that of the " \
             "SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other " \
             "fundamental NLP tasks."

keyphrases = tagger.predict(input_text)
print(keyphrases)

As can be seen in the output keyphrases 'mask' and 'ing strategies' are treated as separate keyphrases. This seems like a bug while putting together the sub-words during formatting the prediction output.

debanjanbhucs commented 2 years ago

@ad6398 The model files used for prediction can be obtained over here - https://huggingface.co/dmahata/dlkp_test

ad6398 commented 2 years ago

Hey @debanjanbhucs , I printed each token and its tag. As guessed, the issue is not with the decoding algorithm but with the model. It is not trained too well to identify masking strategies as a single keyphrase. The file attached here has tokens and their tag predicted by the model, we can clearly see that mask has B tag, ing has B and strategies has I tag. The same goes for other KPs token_tag.txt