Closed liulizuel closed 8 months ago
So for chinese we have a few models that work (https://huggingface.co/naver/neuclir22-splade-zh and https://huggingface.co/naver/neuclir22-pretrained-zh) but they are mostly trained from scratch. Unfortunately there are some problems using roberta for SPLADE (see Figure 2 of https://user.eng.umd.edu/~oard/pdf/desires22.pdf). For our models we explain a bit how we trained these models in https://arxiv.org/pdf/2303.11171.pdf and https://arxiv.org/pdf/2301.10444.pdf.
Hope this helps, let me know if you have more questions
Without any update, I'm closing this, feel free to reopen if needed
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed.
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed.
在哪改啊老铁,求指导
你在你的所有代码文件里面全局搜索一下512,然后替换成511就行了
---Original--- From: "Yue @.> Date: Mon, Feb 26, 2024 15:18 PM To: @.>; Cc: @.**@.>; Subject: Re: [naver/splade] Can SPLADE adapt to Chinese language ? (Issue #44)
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed.
在哪改啊老铁,求指导
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
你在你的所有代码文件里面全局搜索一下512,然后替换成511就行了 … ---Original--- From: "Yue @.> Date: Mon, Feb 26, 2024 15:18 PM To: @.>; Cc: @.**@.>; Subject: Re: [naver/splade] Can SPLADE adapt to Chinese language ? (Issue #44) I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed. 在哪改啊老铁,求指导 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
我只在这个文件https://huggingface.co/naver/neuclir22-splade-zh/blob/main/config.json里面看到"max_position_embeddings": 514这个参数额,没有找到512相关的参数
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before
Thank you for your reply. I tried this seeting 'return_token_type_ids=False
' but it gave me another error if the input is long RuntimeError: The expanded size of the tensor (xxx) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, xxx]. Tensor sizes: [1, 514]
I finally solve this problem by setting 'max_length=514'
to tokenization
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before
Thank you for your reply. I tried this seeting '
return_token_type_ids=False
' but it gave me another error if the input is longRuntimeError: The expanded size of the tensor (xxx) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, xxx]. Tensor sizes: [1, 514]
I finally solve this problem by setting 'max_length=514'
to tokenization
oh great. I would recommend limiting to 512 though, it would make more sense with the training
Hi, I am interested in your great work. I tried to tran a SPLADE model based on Roberta from huggingface https://huggingface.co/hfl/chinese-roberta-wwm-ext in my retrieval task over Chinese corpus. But the result is not satisfied. In inference stage, my codes are as follows,
Then, I got the result:
Here are two questions come to me, 1.Can SPLADE adapt to Chinese language? 2.What should I do to extend SPLADE to Chinese corpus?