hanwei2008 / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

Title disambiguation text extraction #93

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Search pages like: Ytterbium(III)_chloride_(data_page), Willen(Wittmund)
2.JWPL will not return the page

What is the expected output? What do you see instead?
I expected Ytterbium(III)_chloride_(data_page) and Willen(Wittmund)
I see Ytterbiu_(III)_chloride_(data_page) and Wille_(Wittmund) insdead

What version of the product are you using? On what operating system?
1.0

Please provide any additional information below.
I notice in line 71-85, Title.java, the re-formalization of the title causes 
this issue.The character before the "(" is substituted into " " by default. 
However, the above examples, the characters are not " " ("m" and "n"). I 
suggest to change the code in line 79-85:
if (matcherNamespace.find()) {
    this.entity = matcherNamespace.group(1);
    this.disambiguationText = matcherNamespace.group(2);

    String relevantTitleParts = this.entity + " (" + this.disambiguationText + ")";
    this.plainTitle = decodeTitleWikistyle(relevantTitleParts);
    this.wikiStyleTitle = encodeTitleWikistyle(relevantTitleParts);

into

int lpIdx = titlePart.lastIndexOf("(");
if (lpIdx != -1 && titlePart.charAt(titlePart.length-1)==')') {
    this.entity = titlePart.subString(0, lpIdx);
    this.disambiguationText = titlePart.subString(lpIdx+1, titlePart.length-1);
    this.plainTitle = decodeTitleWikistyle(titlePart);
    this.wikiStyleTitle = encodeTitleWikistyle(titlePart);

Original issue reported on code.google.com by astronau...@gmail.com on 15 May 2012 at 12:21

GoogleCodeExporter commented 9 years ago
This has been fixed in Issue81 and should work with the current snapshot.
The fix will be included in the next release.
I have checked the examples you provided. They are all working with the current 
snapshot.

Original comment by oliver.ferschke on 16 May 2012 at 10:07

GoogleCodeExporter commented 9 years ago
The regular expression: (.*?)[ _]\\((.+?)\\)$ can not extract the 
disambiguation text from the second example "Willen(Wittmund)"

Original comment by astronau...@gmail.com on 16 May 2012 at 11:37

GoogleCodeExporter commented 9 years ago
You're right. I'm reopening the issue.
I don't have a solution right now.
Your suggested fix breaks several other test cases, so we cannot use that 
directly.

Original comment by oliver.ferschke on 16 May 2012 at 2:20

GoogleCodeExporter commented 9 years ago
Are there actual articels that have there disambiguation part following 
directly without a white space. As far as i can see the example above is only a 
redirect.

Original comment by SamyAt...@googlemail.com on 18 May 2012 at 10:24

GoogleCodeExporter commented 9 years ago
Some other examples:

Best_Host_in_a_Variety_Programme(Golden_Bell_Awards)
Chhatrapati_Shivaji_Institute_of_Technology(CSIT)
Bite_the_bullet(album)
Dariyabad(Vidhan_Sabha_constituency)
Majhighariani_Institute_of_Technology_and_Science(MITS)
Midwest_Herald(Nigeria)
NILE(National_Institute_for_Lifelong_Education,_Korea)
NM_Institute_Of_Engineering_and_Technology(NMIET)
Narsimha(1991_film)
Nojom_Ajdabiya(Ajdabiya)
Second_Division_Men(Icelandic_Basketball)
Shrirampur(Rural)
Srimad_BhagavadGeeta_Tatparya(Jeevan_Dharma_Yoga)

Original comment by astronau...@gmail.com on 18 May 2012 at 11:42

GoogleCodeExporter commented 9 years ago
I put quite some thought into this issue over the weekend and had a talk with 
another developer.
Unfortunately, we could not find a good solution to the problem, as there are 
both parenthetic expressions with and without leading whitespaces and both 
expression with and without a disambiguation purpose. Afaik, there is no 
bulletproof way to tell whether an expression should be treated as a 
disambiguation part or not.

I am closing this issue as a "WontFix", but I am still open to suggestens. 
Since we do not have a perfect solution and a change will probably result in 
another imperfect solution, we can leave it unchanged in the first place.

So, again, good suggestions are welcome. But for now, we do not have a way to 
fix this.

Original comment by oliver.ferschke on 21 May 2012 at 8:41