memray / OpenNMT-kpg-release

Keyphrase Generation
MIT License
216 stars 34 forks source link

tokenizing of letters, on prediction #56

Closed ahadda5 closed 1 year ago

ahadda5 commented 2 years ago

I'm running a version of the infer.py similar to this

The code is technically the same other than my own customer dataset. The tokenizer is memray/bart_wikikp. The code was trained on my own database. What is perplexing on running predictions i get this , which is not bad if we remove those special tokens <> and join the words... e.g. xss injection malicious is a pretty good keyphrase. Why the letter by letter output and the separating special tokens?? I'm missing something fundamental here.

["<s><s><s>x", "s", "s<category>,", "i", "n", "j", "e", "c", "t", " ", "m", "a", "l", "i<category>c", "i<header>o", "u", "s<infill> ", "c<infill>o", "d", "e<header> ", "i<infill>n", "t<category>o", " <category>l", "o", "n<category>g", "e<category>r", " <infill>s", "u<category>p", "p", "o<category>rt", "e<infill>d", " <seealso>", "s<header>", "s<seealso> ", "f", "i<seealso>", "l<infill>e", " ", "v", "a<infill>", "m<infill>", ""]

["<s><s><s>x", "s", "s<category>,", "c", "r", "o", "s<infill>s", " ", "s<header>i", "t", "e", " <category>s", "c<category>r", "i", "p", "t<header>i<category>n", "g", ",", "i<category>m", "pr", "r<category>o", "p<category>e", "r<infill> ", "u", "s<seealso>e", "d", " <infill>i", "n", "p<infill>u", "t<category> ", "v", "a", "l", "i<infill>d", "a<infill>t", "i<header>", "n<category>", "a<seealso>", "m", "e<category>m<category>", "o<present>"]

memray commented 1 year ago

This doesn't look right to me. I manually added some special tokens to the tokenizer (I regret that), so could you try loading the tokenizer files I uploaded here?