Why use two types of word segmentation methods?

lephong / mulrel-nel

named entity linking with latent relations

Apache License 2.0

172 stars 36 forks source link

Why use two types of word segmentation methods? #22

Closed jianyucai closed 4 years ago

jianyucai commented 4 years ago

Hi, thanks for your code.

I have a question on the word segmentation methods on the contexts of mentions. I noticed that you apply two different methds. In the first method, you split the words from m['context']. https://github.com/lephong/mulrel-nel/blob/db14942450f72c87a4d46349860e96ef2edf353d/nel/ed_ranker.py#L196-L199

In the second method, the words are already segemented in files like aida_train.txt. https://github.com/lephong/mulrel-nel/blob/db14942450f72c87a4d46349860e96ef2edf353d/nel/ed_ranker.py#L215-L216 https://github.com/lephong/mulrel-nel/blob/db14942450f72c87a4d46349860e96ef2edf353d/nel/ed_ranker.py#L219-L220

So, I am wondering why not apply a uniform method?

lephong commented 4 years ago

Sorry I can't recall completely. What I remember is:

The first context is for the local model which is reused from Ganea and Hofmann 2017. The context they used (so our first context) is cross-sentence while our second context is not. Because of that, we reused their data preprocessing (and their code) for creating the first context. Hence, we used their word segmentation as well. However, for the second context, we just simply used the word segmentation given by the dataset.

I don't think there is any major difference between the two segmentation methods. Just that they are easy for us to compute different contexts.

jianyucai commented 4 years ago

Thanks for your reply!

May I ask another question? I noticed that the candidate entities are ranked in decreasing order by p(e|m). However, different mentions in the dataset usually have different number of candidates. Some mentions may only have 1 or 2 candidate entities, whiles some others may have as many as 100 candidates.

I understand that there is a large number of entities (as many as 20,0000), so it is desirable to select a small set of candidates in advance. But, I am wondering, why not set a uniform number for the candidate selection. For example, uniformly select 100 candidates for each mention. Is this also due to the result of Ganea and Hofmann 2017?

lephong commented 4 years ago

For a mention, the number of candidates depends on the alias dictionary extracted from Wikipedia. For instance, if the alias is "Obama", there can be 1000 candidates. But for "Barrack Obama", the number of candidates should be smaller, maybe only 50.

I remember that in the dataset, there are some mentions that have pretty few candidates due to the alias dictionary. That's why you see some mentions have only 2-3 candidates whereas others have 100.

jianyucai commented 4 years ago

Thanks for your reply!