piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.58k stars 4.37k forks source link

Add interlinks to segment_wiki #1712

Closed menshikh-iv closed 6 years ago

menshikh-iv commented 6 years ago

Idea

Users ask about this feature, this is really useful to have interlinks in the dump to construct the graph of articles or use relation between articles in any way.

What's need to implement

Add field "section_interlinks" (list of str) that contains a list of article titles referenced by this section.

napsternxg commented 6 years ago

Thanks for creating this issue. The suggested step is great, but to make it more consistent with the overall structure of the output, we should not only use the string of the link text, but also the Wikipage title it points to.

Another suggestion would be to include the span of the matched text with the begin offset and the end position. This will result in getting a segmented corpus for free based on your technique. It can be later used to tokenize the section text with link text items as a single unit.

So the final format may look like:

"section_interlinks":  [
("link string", "link wikipage title", offset, end)
]
piskvorky commented 6 years ago

The corpora and code included with gensim are restricted to topic modelling and unsupervised text processing. We're not aiming to be "everything for everybody".

Including other types of information (supervised labels, graph structure) is possible but needs to be clearly motivated.

@napsternxg how would you use this extra information? What is the intended application?

napsternxg commented 6 years ago

@piskvorky I understand the requirement to for gensim being focused on topic modelling and unsupervised text processing.

The major application area is utilizing multi-word units in Wikipedia which are usually linked to other wiki pages - as components of topic models and other text processing. E.g. simple tokenization will split words like "Barack Obama" or "Natural Language Processing". Although, there is support for extracting Ngrams using the Phrases module, a more principled approach when processing wiki pages would be to identify these phrases as a single concept (which is very easy to do for Wikipedia). Mapping the wiki link text to the wiki page would allow for normalizing these phrases to a common concept in Wikipedia. E.g. LDA is both "Latent Dirichlet allocation" and "Linear Discriminant Analysis". This will help in reducing the vocabulary size.

Finally, the motivation to allow offset and end values in the json data, was to help in overriding tokenization flaws, especially with biomedical and chemical names.

These were the use cases I had in mind. I would be happy to see this feature since I have been quite impressed with the processing speed of algorithms in gensim, and the wikipedia dump parser appears to be very fast.

Another alternative would be to use the segment_all_articles generator to add this feature as a post processing step using the article_sections variable. However, this would require that article_sections contains the original text and not the filter_wiki plain text. https://github.com/RaRe-Technologies/gensim/blob/07c3130283a7512f74293a18eff4344cdbe85f94/gensim/scripts/segment_wiki.py#L83

menshikh-iv commented 6 years ago

Thank you @napsternxg, maybe you'll try to implement this feature, this will be great!

napsternxg commented 6 years ago

I can have a look at it after December 15th. Will send a PR then.

steremma commented 6 years ago

Hey @napsternxg I have been working on adding this feature, you can check the PR. At the moment the json output contains a list of all interlinks found in the article (rather than presenting the interlinks per section). Is there any reason why you would want to know from which section the interlink came from? If yes then we can make the change (it won't be a huge modification). Else we can merge into develop.

@piskvorky any opinions?

piskvorky commented 6 years ago

Thank you for the explanation, that makes sense.

I don't think identifying the interlink location down to a section is critical. But the voice of people who actually use this feature is more important than mine -- do you think the section is important? What are the pros/cons?

napsternxg commented 6 years ago

@steremma this is great thanks for adding this in. My usecase was being able to identify the multi word unit in the text along with what wiki it points to. But I don't think the current approach may be able to take care of this as the current approach removes that information and only retains the link to the wiki. If we can also have the interlink text and identify what wiki it points to that would help in training multi word word vectors more effectively. But this approach is also quite useful as we can just include the interwiki links as document tags and train the the document embeddings with that information.

menshikh-iv commented 6 years ago

@steremma it's possible to do that @napsternxg suggested?

steremma commented 6 years ago

I am manually checking sample wiki pages in our test set and it appears that in most cases the text link is exactly the same as the title it points to. There are a few cases where the text is altered a little bit.

So adding this map would show an output mostly like this: "computer science": "computer science", "mathematics": "mathematics" ... but with occasional differences like "Android": "[Android (operating system)"

Doing this is would make a difference in my implementation because I am now using the filtered text to find the interlinks and as @napsternxg mentioned the exact article title is lost. We would need to instead duplicate the filter_wiki logic with a small change in one of the regular expressions used.

EDIT: It can be easily done by adding another boolean argument to filter_wiki. This will have a default value to make sure existing calls get the same results but when called with a False value will not modify the interlinks.

steremma commented 6 years ago

Done, please check updated PR