sayef / fsner

Few-shot Named Entity Recognition
119 stars 6 forks source link

The trained model does not work very well... #8

Open ScottishFold007 opened 2 years ago

ScottishFold007 commented 2 years ago

Hello! Your open source project is great and is a great benefit! When I was testing the Chinese dataset, I found that I ran a few epochs and the results were not very good. Can you tell me what might be the cause of this? Train data: image

Example prediction:


import json
from fsner import FSNERModel, FSNERTokenizerUtils, pretty_embed

query_texts = [
    "阿贵住在户部巷吗?",
    "我不喜欢看《人鱼传说》",
    "我喜欢李柏林的'天空之城',写的很好"
]

support_texts = {
    "地址": [
            "彭小军认为,国内银行现在走的是[E]台湾[/E]的发卡模式,先通过跑马圈地再在圈的地里面选择客户,", 
        "郑阿姨就赶到[E]文汇路[/E]排队拿钱,希望能将缴纳的一万余元学费拿回来,顺便找校方或者教委要个说法。", 
        "如今着整个[E]潮白河[/E]区域环境的巨大变化和环首都经济圈的快速推进,夏威夷水岸1号的稀缺价值越来越明显,", 
        "如今着整个潮白河区域环境的巨大变化和环首都经济圈的快速推进,[E]夏威夷水岸1号[/E]的稀缺价值越来越明显,",
         "这也让很多业主据此认为,[E]雅清苑[/E]是政府公务员挤对了国家的经适房政策。"
                  ],
    "书籍": [
         "除了冠军外有7个名额的入围奖,奖品是[E]《暗黑破坏神》全套小说[/E]、《魔兽争霸》全套小说", 
         "除了冠军外有7个名额的入围奖,奖品是《暗黑破坏神》全套小说、[E]《魔兽争霸》全套小说[/E]", 
         "本次促销活动赠送的周边产品全部都是限量版啊!值得一提的是[E]《红楼梦》[/E]精美人物主题书签A组、", 
         "“去年银监会下发[E]《关于信用卡套现活跃风险提示的通知》[/E]要求:严格禁止将pos机发放在个人名下,",
    ]
          }

device = 'cpu'

model_path = '/content/checkpoints/model'
tokenizer = FSNERTokenizerUtils(model_path)
queries = tokenizer.tokenize(query_texts).to(device)
supports = tokenizer.tokenize(list(support_texts.values())).to(device)

model = FSNERModel(model_path)
model.to(device)

p_starts, p_ends = model.predict(queries, supports)

# One can prepare supports once and reuse  multiple times with different queries
# ------------------------------------------------------------------------------
# start_token_embeddings, end_token_embeddings = model.prepare_supports(supports)
# p_starts, p_ends = model.predict(queries, start_token_embeddings=start_token_embeddings,
#                                  end_token_embeddings=end_token_embeddings)

output = tokenizer.extract_entity_from_scores(query_texts, queries, p_starts, p_ends,
                        entity_keys=list(support_texts.keys()), thresh=0.010)

print(json.dumps(output, indent=2,ensure_ascii=False))

# install displacy for pretty embed
pretty_embed(query_texts, output, list(support_texts.keys()))

image

sayef commented 2 years ago

Thanks for trying it out. This will be very helpful to fix bugs and make the library more usable.

Which pretrained-model did you use for training? For English, I used bert-base-uncased.

ScottishFold007 commented 2 years ago

Because I am using a Chinese dataset, the model Langboat/mengzi-bert-base, which is also based on the Chinese corpus, is used for training

ScottishFold007 commented 2 years ago

I don't know if it would have anything to do with the language, but East Asian languages like Chinese, Korean, and Japanese all require word segmentation.

ScottishFold007 commented 2 years ago

This is the style of the training corpus: {"address": ["彭小军认为,国内银行现在走的是[E]台湾[/E]的发卡模式,先通过跑马圈地再在圈的地里面选择客户,", "郑阿姨就赶到[E]文汇路[/E]排队拿钱,希望能将缴纳的一万余元学费拿回来,顺便找校方或者教委要个说法。", "如今着整个[E]潮白河[/E]区域环境的巨大变化和环首都经济圈的快速推进,夏威夷水岸1号的稀缺价值越来越明显,", "如今着整个潮白河区域环境的巨大变化和环首都经济圈的快速推进,[E]夏威夷水岸1号[/E]的稀缺价值越来越明显,", "这也让很多业主据此认为,[E]雅清苑[/E]是政府公务员挤对了国家的经适房政策。", "陈艳萍:买[E]西山[/E]的人的购房需求,主要有两种,一种是养老型的需求,很多人认为在西山是能够颐养天年的"], "name": ["[E]彭小军[/E]认为,国内银行现在走的是台湾的发卡模式,先通过跑马圈地再在圈的地里面选择客户,", "[E]温格[/E]的球队终于又踢了一场经典的比赛,2比1战胜曼联之后枪手仍然留在了夺冠集团之内,", "突袭黑暗雅典娜》中[E]Riddick[/E]发现之前抓住他的赏金猎人Johns,", "突袭黑暗雅典娜》中Riddick发现之前抓住他的赏金猎人[E]Johns[/E],", "吴三桂演义》小说的想像,说是为[E]牛金星[/E]所毒杀。……在小说中加插一些历史背景,", "市场仍存在对网络销售形式的需求,网络购彩前景如何?为此此我们采访业内专家[E]程阳[/E]先生。", "本报讯(记者[E]王吉瑛[/E])双色球即将出台新规,一等奖最高奖金可达到1000万元。昨天,中彩中心透露,", "价格高昂的大钻和翡翠消费为何如此火?通灵珠宝总裁[E]沈东军[/E]认为,这与原料稀缺有直接关系。“", "是目前表现最好的锋线组合之一,而[E]沃尔科特[/E]往往能够让对手的整个左边肋疲于防守,以目前枪手的能力和状态,", "[E]Svensson[/E]在接受媒体采访时表示,CAPCOM并没有放弃《街霸》电影系列,将推出新的《", "证券时报记者[E]唐曜华[/E]", "现役DotA明星选手,担任SOLO位的世界第一影魔[E]Pis(卜严骏)[/E],", "[E]郭庆祥[/E]:我们看画廊如果有好的艺术家,好的作品进去,我们是真正想去买好的艺术作品,而不是投资,", "腾讯新闻昨天[E]金庸[/E]逝世江湖再无金大侠订阅号消息昨天【15条】王者荣耀福利抢鲜看!队友的京东京东jd.", "[E]陈艳萍[/E]:买西山的人的购房需求,主要有两种,一种是养老型的需求,很多人认为在西山是能够颐养天年的,"]

sayef commented 2 years ago

Dataset preparation and pre-trained model selection seem fine. How was the val_loss_epoch and val_acc_epoch after the first few epochs?

ScottishFold007 commented 2 years ago

Dataset preparation and pre-trained model selection seem fine. How was the val_loss_epoch and val_acc_epoch after the first few epochs?

image

sayef commented 2 years ago

Great, Please continue training for at least 20 epochs. It should get better. It was also not great for me too at the first few epochs.

ScottishFold007 commented 2 years ago

Ok, I will continue my training and I will give you feedback later, thank you for your careful reply!

sayef commented 2 years ago

@ScottishFold007 Hi again! Was your training successful?

ScottishFold007 commented 2 years ago

@ScottishFold007 Hi again! Was your training successful?

I have tried many methods, but the results are still poor, I wonder why? Can I give you the data and trouble you to test it?

sayef commented 2 years ago

I can try. Please send a link to your dataset to hello@sayef.tech

ScottishFold007 commented 2 years ago

I can try. Please send a link to your dataset to hello@sayef.tech

Hello, I've sent you the processed training set and test set via hello@sayef.tech, thanks for your help!

polodealvarado commented 2 years ago

Hi @ScottishFold007 ! Could you get better results ?

ScottishFold007 commented 1 year ago

Hi @ScottishFold007 ! Could you get better results ?

still very bad~