Open olavurmortensen opened 8 years ago
I would like to second this request. I strongly prefer a different pre-processing library (spacy) and being forced to rely on the functions in gensim (which are mostly implemented by the original paper authors as part of their proof of concept) is both slow and produces poor results when compared with better pre-processing.
Back when the paper came out and the module was first added to gensim I was able to use summarize_corpus with my own tokens without too much modification but this isn't as easy with the keywords function.
Working specifiaclly with the keywords function because that's what I'm interested in at this point:
I looked into the code (not a lot of comments or docstrings in there btw) a bit and it looks like _build_graph
just takes a list of strings and I don't think that _set_graph_edges
or any of the things it calls actually need the tokens
dictionary. If the text is already pre-processed there is no reason that there should be discrepancies between the tokens
dictionary and the (already) split_text
. If this is the case, there is no reason for _set_graph_edge
to check tokens
.
Ok, next step extracted tokens/lemmas. First, why are the variable and function name denoting two different things? In any case, from what I can tell, everything in keywords.py from line 216-226 is spending a lot of time creating a dictionary that maps word forms to the pagerank score of that word form. If that's right, then there doesn't seem to be much need for that either if the user does their own pre-processing. I'm really not sure if this is what's going on because the code is a bit confusing (maybe it's just me?) but I'm pretty sure that's what's happening.
Now we're almost at the finish line, but get_combined_keywords
also does some text processing (not sure why as everything it's doing should have already been done earlier), it calls _strip_word
. The rest of the functions seem to be centered around dealing with having multiple word forms for various tokens, again, not a problem if the user does their own pre-processing (with the understanding that they may loose some nuance).
To summarize, it may be possible to use your own pre-processing if you refactor everything in the call path of set_graph_edges
and then deal with the pagerank results yourself.
Now that I've gone through all of this I might as well try it and see what happens.
-- Eric
@olavurmortensen I got the keywords function to work on an array of string: https://gist.github.com/ELind77/c7c47d68ac7aa05f1493bc492de7dd5c The refactor isn't very pretty but it works. Hope this helps even though it's only half of what you were looking for.
@tmylk Any thoughts on what a real refactor should look like in terms of the code path and return values?
@ELind77 Actually just Spacy integration would be great and is easier than an abstract refactor. That is the best preprocessing library around and we should direct people to use it.
@tmylk I completely agree the spacy is amazing. But the gensim code still needs to be refactored to accept user pre-processed text, right?
Yes.
Or just preprocessing it with spacy inside gensim seems easier and more convenient.
As much as I do like spacy, in my opinion it wouldn't be addressing the key issue at hand, which is user choice. Maybe someone wants to use Stanford, or ClearNLP, or something none of us have ever heard of. If that is the case I think gensim should allow that in the same way that it does for all of the other models. I think it also raises issues for separation of concerns with gensim relying on external libraries that are not core to the algorithm being implemented. At it's core, what TextRank is doing is building a graph and running PageRank on it. The only input fundamentally required is an ordered list of tokens (they don't even need to be strings, they could be integer ids). In my opinion (for whatever it's worth) I think it's better to favor flexibility.
I think a good approach would be to keep the option of using the provided pre-processing methods for users who aren't interested in doing it themselves and providing the ability for "power users" to provide their own tokens, sentences, etc.
Agree. It would be a welcome PR.
Cool!
I'd like to volunteer. I've never contributed to gensim before though. I've never seen a contributor's guide anywhere or instructions on running (and writing) tests is there a resource for that? I seem to remember I used doctests to verify when I modified the summarize_corpus
function for my own uses a while back but I don't really remember any unit tests or anything.
Also, is there a project preference in terms of how the interface for a function that has this kind of dual functionality should look?
Great, thank you! I'll take a look at the guidelines and clone the project tonight and see what I can do. Not sure how long it will take, but I may have a little time over the Thanksgiving holiday to get some coding done.
I'll post here with any followup questions.
Nice, thanks for picking this up @ELind77!
Following this thread, would be nice to have the pre-processing modularised separately!
While spacy is a great library, I found the sentence segmentation code used by the summarizer to be superior overall for my use case: summarizing SEC filings. SEC filings are legal documents so the language is formal, but they also contain a lot of tables, forms, and lists which don't follow normal grammar rules. I have been amazed at how well sentences are segmented by gensim.
I guess what I am saying is that I agree with making the API more modular, but I would hate to see gensim's sentence segmentation implementation thrown away.
@ELind77 you still taking a stab at this? If not, I'd like to - I've some free time on my hands.
@bhargavvader I just started a new job and I need approval before I can commit to opensource projects. I'm not sure how long that will take. If you want to take this please post progress here. Otherwise I will have to come back to it when I have time.
Cool, will open a PR soon then :)
My 2 cents: we definitely don't want to add a hard dependency on spaCy. It's compiled, slow to load, requires lots of memory and, like @ELind77 and @rpedela say, users may prefer other tools/approaches, for whatever reason.
Apologies for the poor code quality and design of this module (API, comments, documentation etc). It was a student contribution and perhaps our checks weren't as thorough as they should have been. @tmylk is being more careful with new code now!
@piskvorky I've settled down at my new job and gotten permission to start contributing again. If there is still a need for this I'll give it a shot next month. I should have something before the end of September.
@bhargavvader If you have a fork with this implemented please post here to let me know.
Awesome, welcome back :) CC @menshikh-iv
Sorry @ELind77 , been too busy to get back to this, but this is really a feature which would be great to have implemented some time soon...
Another similar request - it would be cool if we could plug in our own distance metrics in modules which use any kind of similarity/distance functions, like the Similarity
class. (ping @piskvorky , @menshikh-iv )
I have managed to customize the summarization and keywords method for the German language by creating my own preprocessing with Spacy and changing only a few lines of code in the summarizer and keywords method. The results that I have achieved are really satisfying.
If you want to do something similar, then you should pay attention to the following: The summarizer method expects a list of SyntacticUnits (gensim.summarization.syntactic_unit), where each SyntacticUnit stands for a corresponding sentence. The Keywords method expects a dictionary of SyntacticUnits, where each SyntacticUnit stands for a word of the text. The respective key of the SyntacticUnit would be the corresponding preprocessed token of the word.
Each SyntacticUnit expects as input the original sentence or word and the preprocessed token set or token.
I have managed to customize the summarization and keywords method for the German language by creating my own preprocessing with Spacy and changing only a few lines of code in the summarizer and keywords method. The results that I have achieved are really satisfying.
If you want to do something similar, then you should pay attention to the following: The summarizer method expects a list of SyntacticUnits (gensim.summarization.syntactic_unit), where each SyntacticUnit stands for a corresponding sentence. The Keywords method expects a dictionary of SyntacticUnits, where each SyntacticUnit stands for a word of the text. The respective key of the SyntacticUnit would be the corresponding preprocessed token of the word.
Each SyntacticUnit expects as input the original sentence or word and the preprocessed token set or token.
Hi bastij, I would be really interested in your solution. Would you be willing to share your code? Thanks in advance!
It is not possible, at the moment, to pre-process text yourself before applying the summarization module to it. Among other things, this means that you cannot summarize non-english text, even though the model itself is language independent.
As far as I can see, it is only this line, that is language dependent. Similarly, this line in the
keywords
module. I'm not familiar with the codebase, so it is well possible that it is more complicated.I suggest that someone (could be me) make this pre-processing optional, so you could either
_clean_text_by_word(text)
).This was brought to my attention on the blog, where someone wanted to summarize a portuguese dataset.