syuoni / eznlp

Easy Natural Language Processing
Apache License 2.0
130 stars 21 forks source link

Boundary smoothing 时候,两个entity正好在旁边怎么办 #50

Closed terenceau2 closed 1 month ago

terenceau2 commented 4 months ago

你好,我想请教一个理论性的问题

我假如有一个entity,class是entity A的,他的位置是句子中的第3个字,span position就是(3,3)。旁边又有另外一个entity,是entity type B,位置是(4,4)

然后我现在做boundary smoothing(of distance 2, epsilon=0.2), 那entity A的probability就是1-0.2=0.8,旁边的,譬如(4,4)就会被分到一些些,epsilon/num_of_surrounding_spans 这样就撞了(4,4)的entity B,这种情况会如何处理? (同理对于这个entity b,做smoothing的时候他也会撞到(3,3)的entity A

terenceau2 commented 4 months ago

以上这种情况,就是(4,4)会有0.8 的entity B,再加上epsilon/num_of_surrounding_spans 的entity A。 是这样吗? 这是我根据这段code的理解

                for label, start, end in self.chunks:
                    label_id = config.label2idx[label]
                    self.label_ids[start, end-1, label_id] += (1 - config.sb_epsilon)

                    for dist in range(1, config.sb_size+1):
                        eps_per_span = config.sb_epsilon / (config.sb_size * dist * 4)
                        sur_spans = list(_spans_from_surrounding((start, end), dist, self.num_tokens))
                        for sur_start, sur_end in sur_spans:
                            self.label_ids[sur_start, sur_end-1, label_id] += (eps_per_span*config.sb_adj_factor)
                        # Absorb the probabilities assigned to illegal positions
                        self.label_ids[start, end-1, label_id] += eps_per_span * (dist * 4 - len(sur_spans))

因为我没有太理解您paper里的这一句: After such entity probability re-allocation, any remaining probability of a span is assigned to be “non-entity”

syuoni commented 4 months ago

你好,

按照公式,Span (3, 3)

类似地,Span (4, 4)

可以看到,ground-truth 是一个在所有实体类别上的概率分布,使用 soft-label cross entropy 将预测概率分布拟合到这个 ground-truth 概率分布即可。

houyuchao commented 1 month ago

eps_per_span

您好我想问这段代码在哪个文件里啊,我怎么没找到

syuoni commented 1 month ago

在这里哈:https://github.com/syuoni/eznlp/blob/master/eznlp/model/decoder/boundaries.py#L173-L179

houyuchao commented 1 month ago

非常感谢

---原始邮件--- 发件人: "Enwei @.> 发送时间: 2024年7月1日(周一) 下午5:19 收件人: @.>; 抄送: @.**@.>; 主题: Re: [syuoni/eznlp] Boundary smoothing 时候,两个entity正好在旁边怎么办 (Issue #50)

在这里哈:https://github.com/syuoni/eznlp/blob/master/eznlp/model/decoder/boundaries.py#L173-L179

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

houyuchao commented 1 month ago

以上这种情况,就是(4,4)会有0.8 的entity B,再加上epsilon/num_of_surrounding_spans 的entity A。 是这样吗? 这是我根据这段code的理解

                for label, start, end in self.chunks:
                    label_id = config.label2idx[label]
                    self.label_ids[start, end-1, label_id] += (1 - config.sb_epsilon)

                    for dist in range(1, config.sb_size+1):
                        eps_per_span = config.sb_epsilon / (config.sb_size * dist * 4)
                        sur_spans = list(_spans_from_surrounding((start, end), dist, self.num_tokens))
                        for sur_start, sur_end in sur_spans:
                            self.label_ids[sur_start, sur_end-1, label_id] += (eps_per_span*config.sb_adj_factor)
                        # Absorb the probabilities assigned to illegal positions
                        self.label_ids[start, end-1, label_id] += eps_per_span * (dist * 4 - len(sur_spans))

因为我没有太理解您paper里的这一句: After such entity probability re-allocation, any remaining probability of a span is assigned to be “non-entity”

您好您复现完了这篇论文了吗?我有点搞不明白在主程序entity_recognition中是如何调用boundary smoothing方法的

dpj135 commented 1 month ago

你好,

按照公式,Span (3, 3)

  • 属于 Entity A 的概率为 1 - eps = 1 - 0.2 = 0.8
  • 属于 Entity B 的概率为 eps / (sb_size dist 4) = 0.2 / (2 2 4) = 0.0125
  • 属于 Non-entity 的概率为 1 - 0.8 - 0.0125 = 0.1875

类似地,Span (4, 4)

  • 属于 Entity B 的概率为 0.8
  • 属于 Entity A 的概率为 0.0125
  • 属于 Non-entity 的概率为 0.1875

可以看到,ground-truth 是一个在所有实体类别上的概率分布,使用 soft-label cross entropy 将预测概率分布拟合到这个 ground-truth 概率分布即可。

您好,如果是这样计算的话,周围非实体的span其概率和不就超过1了吗?是也要将非实体span的“non-entity”类概率对应减小吗

syuoni commented 1 month ago

以上这种情况,就是(4,4)会有0.8 的entity B,再加上epsilon/num_of_surrounding_spans 的entity A。 是这样吗? 这是我根据这段code的理解

                for label, start, end in self.chunks:
                    label_id = config.label2idx[label]
                    self.label_ids[start, end-1, label_id] += (1 - config.sb_epsilon)

                    for dist in range(1, config.sb_size+1):
                        eps_per_span = config.sb_epsilon / (config.sb_size * dist * 4)
                        sur_spans = list(_spans_from_surrounding((start, end), dist, self.num_tokens))
                        for sur_start, sur_end in sur_spans:
                            self.label_ids[sur_start, sur_end-1, label_id] += (eps_per_span*config.sb_adj_factor)
                        # Absorb the probabilities assigned to illegal positions
                        self.label_ids[start, end-1, label_id] += eps_per_span * (dist * 4 - len(sur_spans))

因为我没有太理解您paper里的这一句: After such entity probability re-allocation, any remaining probability of a span is assigned to be “non-entity”

您好您复现完了这篇论文了吗?我有点搞不明白在主程序entity_recognition中是如何调用boundary smoothing方法的

entity_recognition.py 里调用 boundary_smoothing 在这里:https://github.com/syuoni/eznlp/blob/master/scripts/entity_recognition.py#L244-L259

syuoni commented 1 month ago

你好, 按照公式,Span (3, 3)

  • 属于 Entity A 的概率为 1 - eps = 1 - 0.2 = 0.8
  • 属于 Entity B 的概率为 eps / (sb_size dist 4) = 0.2 / (2 2 4) = 0.0125
  • 属于 Non-entity 的概率为 1 - 0.8 - 0.0125 = 0.1875

类似地,Span (4, 4)

  • 属于 Entity B 的概率为 0.8
  • 属于 Entity A 的概率为 0.0125
  • 属于 Non-entity 的概率为 0.1875

可以看到,ground-truth 是一个在所有实体类别上的概率分布,使用 soft-label cross entropy 将预测概率分布拟合到这个 ground-truth 概率分布即可。

您好,如果是这样计算的话,周围非实体的span其概率和不就超过1了吗?是也要将非实体span的“non-entity”类概率对应减小吗

是的