naver / splade

SPLADE: sparse neural search (SIGIR21, SIGIR22)
Other
710 stars 79 forks source link

Can SPLADE adapt to Chinese language ? #44

Closed liulizuel closed 8 months ago

liulizuel commented 11 months ago

Hi, I am interested in your great work. I tried to tran a SPLADE model based on Roberta from huggingface https://huggingface.co/hfl/chinese-roberta-wwm-ext in my retrieval task over Chinese corpus. But the result is not satisfied. In inference stage, my codes are as follows,

texts = ['王者荣耀好玩吗', '带你上王者', '如何下载王者荣耀', '鲁班怎么利用普通攻击']
embeds = batch_embed_doc(texts=texts, encoder=encoder, tokenizer=tokenizer, max_len=max_doc_len)
for i in range(len(texts)):
    print(texts[i])
    print(tokenizer.decode(embeds[i].topk(k=40).indices))

Then, I got the result:

王者荣耀好玩吗
700 喺帐 卷喉鲱 st fgo 蠹44 改判 短淇 貂 混华 oil賽 陇 谁 00 邇 呐 ssd 踝 ⒈ 2014 洞 天ᅦ 诰 or 西 乌 京 艷對 鬼 nt
带你上王者
700 呐 nt st 爸淇ᅦ 踝 git 艷鲱 dyson 貂淮44 ( 输 卷 购 53 才 葦 誣鼹 is 揶項θ 佈 cdma 贡 i3 { 马 fgo 邇 搜 以 乌帐
如何下载王者荣耀
700帐 喺喉 fgo 卷判 貂 短44鲱 st 蠹华 改 谁 00 oil淇 陇賽 混 ssd 踝 2014 or ⒈ 天ᅦ 邇 艷 射 璉 京浣 战 載對 跚 呐
鲁班怎么利用普通攻击
職 诰 据 尖閏哄my 20尔x 漏 表 才 剃 32g5s gohappymic 灞首缆 塊 互 山 种 怡 购椎 麒 奈級曇 膏 洛污 唔 find 躁

Here are two questions come to me, 1.Can SPLADE adapt to Chinese language? 2.What should I do to extend SPLADE to Chinese corpus?

cadurosar commented 10 months ago

So for chinese we have a few models that work (https://huggingface.co/naver/neuclir22-splade-zh and https://huggingface.co/naver/neuclir22-pretrained-zh) but they are mostly trained from scratch. Unfortunately there are some problems using roberta for SPLADE (see Figure 2 of https://user.eng.umd.edu/~oard/pdf/desires22.pdf). For our models we explain a bit how we trained these models in https://arxiv.org/pdf/2303.11171.pdf and https://arxiv.org/pdf/2301.10444.pdf.

Hope this helps, let me know if you have more questions

cadurosar commented 8 months ago

Without any update, I'm closing this, feel free to reopen if needed

315930399 commented 4 months ago

I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed

liulizuel commented 4 months ago

I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed

I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed.

315930399 commented 4 months ago

I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed

I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed.

在哪改啊老铁,求指导

liulizuel commented 4 months ago

你在你的所有代码文件里面全局搜索一下512,然后替换成511就行了

---Original--- From: "Yue @.> Date: Mon, Feb 26, 2024 15:18 PM To: @.>; Cc: @.**@.>; Subject: Re: [naver/splade] Can SPLADE adapt to Chinese language ? (Issue #44)

I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed

I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed.

在哪改啊老铁,求指导

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

315930399 commented 4 months ago

你在你的所有代码文件里面全局搜索一下512,然后替换成511就行了 ---Original--- From: "Yue @.> Date: Mon, Feb 26, 2024 15:18 PM To: @.>; Cc: @.**@.>; Subject: Re: [naver/splade] Can SPLADE adapt to Chinese language ? (Issue #44) I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed. 在哪改啊老铁,求指导 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

我只在这个文件https://huggingface.co/naver/neuclir22-splade-zh/blob/main/config.json里面看到"max_position_embeddings": 514这个参数额,没有找到512相关的参数

carlos-lassance commented 4 months ago

I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed

I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before

315930399 commented 4 months ago

I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed

I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before

Thank you for your reply. I tried this seeting 'return_token_type_ids=False' but it gave me another error if the input is long RuntimeError: The expanded size of the tensor (xxx) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, xxx]. Tensor sizes: [1, 514] I finally solve this problem by setting 'max_length=514' to tokenization

carlos-lassance commented 4 months ago

I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed

I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before

Thank you for your reply. I tried this seeting 'return_token_type_ids=False' but it gave me another error if the input is long RuntimeError: The expanded size of the tensor (xxx) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, xxx]. Tensor sizes: [1, 514] I finally solve this problem by setting 'max_length=514' to tokenization

oh great. I would recommend limiting to 512 though, it would make more sense with the training