hanwei2008 / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

Add Method that returns all Outlink Anchors #85

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Currently there is a method in the de.tudarmstadt.ukp.wikipedia.api.Page.java 
class called "getOutlinkAnchors()" that "only returns the anchors that are not 
equal to the title of the page they are pointing to".

There should be another method that returns all outlink anchors including the 
ones that are equal to the title of the page they are pointing to

Why?
There are word-sense-disambiguation applications that need to know how often an 
anchor is used for a certain page. They use this probability as a feature for 
disambiguation algorithms and also as a baseline disambiguation, by choosing 
the most frequent sense for a word.

example work:
http://www.cs.waikato.ac.nz/~ihw/papers/08-DNM-IHW-LearningToLinkWithWikipedia.p
df
http://www.di.unipi.it/~ferragin/cikm2010.pdf
http://cogcomp.cs.illinois.edu/papers/RatinovDoRo.pdf

proposed method:

public Map<String, Set<String>> getAllOutlinkAnchors()
        throws WikiTitleParsingException
    {
        Map<String, Set<String>> outAnchors = new HashMap<String, Set<String>>();
        ParsedPage pp = getParsedPage();
        if (pp == null) {
            return outAnchors;
        }
        for (Link l : pp.getLinks()) {
            if (l.getTarget().length() == 0) {
                continue;
            }

            String targetTitle = new Title(l.getTarget()).getPlainTitle();
            if (!l.getType().equals(Link.type.EXTERNAL) && !l.getType().equals(Link.type.IMAGE)
                    && !l.getType().equals(Link.type.AUDIO) && !l.getType().equals(Link.type.VIDEO)
                    && !targetTitle.contains(":")) // Wikipedia titles only contain colons if they
                                                    // are categories or other meta data
            {
                String anchorText = l.getText();
                Set<String> anchors;
                if (outAnchors.containsKey(targetTitle)) {
                    anchors = outAnchors.get(targetTitle);
                }
                else {
                    anchors = new HashSet<String>();
                }
                anchors.add(anchorText);
                outAnchors.put(targetTitle, anchors);
            }
        }

Original issue reported on code.google.com by SamyAt...@googlemail.com on 5 Apr 2012 at 6:31

GoogleCodeExporter commented 9 years ago
Sorry for the delay. I will look into this shortly. Currently, I'm out of 
capacity and have to postpone the work on JWPL for 2 or 3 weeks.

Original comment by oliver.ferschke on 12 Apr 2012 at 7:09

GoogleCodeExporter commented 9 years ago
As we will not be developing the JWPL Parser any more, it has been moved into 
its own module. JWPL will now be using the Sweble parser (www.sweble.org). I am 
currently migrating the API methods that need Wiki markup parsing (like the 
anchor extractors) to the new parser. I will change the semantics of the anchor 
extraction methods so that they will return all anchors.

The old anchor extraction methods have been move to 
de.tudarmstadt.ukp.wikipedia.parser.LinkAnchorExtractor in the parser module.

Original comment by oliver.ferschke on 29 May 2012 at 10:21

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 29 May 2012 at 10:22