Open vprelovac opened 4 years ago
Hi Vladimir, I think you know the code more than me because TextRank was not contributed by me. At least not the current implementation. But I will try to check the code and respond to your questions.
WORDS = re.compile(r"[\w'-]+")
words = WORDS.findall(sentence)
Snowball vs Porter stemmer) To be honest I don't remember the decision. I was years ago. I don't even know if I tried both and picked the better or simply use the first I saw in the documentation.
I barely remember it's a mix of NLTK, wiki freq. words and the stopwords from the other projects I was involved in. Sumy was my experiment in the early days and I used what gave me better results and I started to make it more generic when more people "joined" the project on Github.
Yep, the sentences are separated by the correct end mark, not the new line if that is what you mean.
note: Your tweaked version would leave lonely dashes floating.
Fair enough
Sharing my current stopWords=frozenset(['front', 'wednesday', 'whole', 'thin', "you're", 'appear', 'could', 'further', 'q', 'fri', 'willing', 'years', 'saturday', 'be', 'is', 's', 'various', 'example', 'your', "i'd", 'specifying', 'entirely', 'follows', 'therefore', 'asking', "we're", 'otherwise', 'newsinfo', 'doesn', 'becomes', 'ie', 't', 'inner', 'friday', 'ltd', 'however', 'different', 'herein', 'got', 'mightn', 'lately', "that'll", 'been', 'sometime', 'wherein', 'i', 'inquirer', 'no', 'along', '1', 'ever', 'hereupon', 'mean', 'value', 'described', 'via', '2', 'move', 'shouldn', 'december', 'five', 'anyone', "that's", 'sincere', 'toward', 'useful', 'had', 'normally', 'seems', 'am', 'allows', 'sent', 'april', 'instead', '5', 'yourselves', 'fifth', 'top', 'all', 'hasnt', 'inward', 'say', 'thickv', 'll', 'soon', 'weren', 'while', 'a', '10', 'might', 'sixty', 'anyways', "we've", 'please', 'little', 'least', 'definitely', 'eg', 'her', 'accordingly', 'hereafter', 'home', 'sun', 'y', 'seriously', 'whose', 'clearly', 'the', 'said', 'came', 'herself', 'stories', 'wouldn', 'ain', 'z', 'on', 'doing', 'until', 'except', 'anyhow', 'former', 'concerning', 'same', 'whereby', 'possible', 'going', 'still', "it's", "he's", 'keep', 'see', 'done', 'find', "c's", 'thus', 'indicates', 'ours', 'itself', 'thank', 'inc', 'lest', 'beyond', "wouldn't", 'currently', "we'd", 'himself', 'just', 'thu', 'although', 'consider', 'between', 'far', 'percent', 'o', 'will', 'looking', 'tries', "they're", 'okay', 'cannot', 'put', 'hundred', 'thereafter', 'mainly', 'ex', 'look', 'ten', 'allow', 'thanks', 'getting', 'much', "i've", 'gotten', 'my', 'plus', 'w', 'become', 'why', 'wants', 'after', 'zero', "when's", 'certain', 'unlikely', "how's", '0', 'photo', 'necessary', 'more', 'says', 'ma', 'whereas', 'so', 'whether', 'self', 'afterwards', 'rappler', 'yet', 'especially', 'wonder', "don't", '6', 'in', 'hopefully', 'having', "she'd", 'others', 'myself', 'often', 'tried', 'may', 'awfully', 'whoever', 'does', 'own', 'anything', 'besides', 'gives', "shouldn't", 'c', 'reasonably', 'again', 'associated', 'best', 'tends', 'amount', "aren't", 'ye', 'pm', 'anyway', 'would', 'sorry', 'mine', 'reuters', 'everywhere', 'found', 'of', 'specify', "i'm", 'looks', "hadn't", 're', 'yung', 'able', 'last', "you've", 'few', 'something', 'tue', 'this', "you'd", 'empty', "isn't", 'must', 'either', 'considering', 'whereafter', "we'll", 'eleven', 'usually', 'time', "hasn't", 'our', 'greetings', 'since', 'you', 'thursday', 'particularly', 'gone', 'don', 'above', 'new', 'amongst', 'seen', 'up', 'consequently', 'many', 'needs', 'behind', 'has', 'couldn', 'contain', 'tell', 'under', 'twenty', 'use', 'well', 'following', 'sports', 'later', 'go', 'every', 'but', 'it', 'indeed', 'namely', 'not', "weren't", 'once', 'each', 'first', 'beside', 'hardly', 'did', 'thence', 'liked', 'sub', 'used', 'b', 'hi', 'think', 'maybe', "should've", 'ako', 'rather', 'eight', 'against', "haven't", 'hers', 'too', 'was', 'beforehand', 'rapplercom', 'right', 'vs', 'seem', 'unto', 'sat', 'seemed', 'then', 'welcome', 'when', 'part', 'serious', 'can', 'sup', 'here', 'wherever', 'saying', 'ang', 'second', 'alone', 'another', 'with', 'co', 'according', 'ask', 'nowhere', 'wed', 'despite', 'particular', 'by', 'nothing', 'year', 'qv', 'regarding', 'nd', 'his', 'january', 'side', 'section', 'tuesday', 'never', 'both', 'indicated', "here's", 'quite', 'k', 'full', "couldn't", 'february', 'aren', 'somewhere', 'available', 'yes', 'into', 'per', 'g', "they've", 'thats', 'n', 'than', 'sometimes', 'uucp', 'always', 'back', 'get', 'merely', 'nobody', 'october', 'yourself', 'followed', 'specified', 'even', 'for', 'nor', 'shall', 'rd', 'whence', 'somebody', 'howbeit', 'f', 'news', 'down', 'july', "let's", 'third', 'yours', 'fifteen', 'hadn', 'seeming', '3', 'bottom', 'v', 'saw', 'contains', 'immediate', 'now', 'trying', 'though', 'march', 'story', 'certainly', 'mon', "why's", 'tweet', 'placed', 'latterly', 'monday', 'try', 'haven', 'made', 'changes', 'those', 'latter', 'enough', 'noone', 'together', 'viz', 'someone', 'september', "where's", 'onto', 'make', 'were', 'elsewhere', 'do', 'thorough', 'overall', "he'd", 'thereupon', 'non', 'gets', 'containing', 'he', 'most', 'downwards', 'kept', 'everybody', "shan't", 'towards', 'happens', 'cant', 'already', 'how', 'un', 'using', 'sure', 'nine', 'meanwhile', "didn't", 'great', 'selves', 've', 'because', 'outside', 'some', 'there', 'four', 'amoungst', 'from', 'take', 'way', 'detail', 'throughout', 'moreover', 'anywhere', "i'll", 'among', 'oh', 'actually', 'isn', 'l', 'comes', 'six', 'wasn', 'an', 'ourselves', 'them', 'over', 'wish', "what's", 'only', 'keeps', 'being', 'upon', 'regardless', 'm', 'didn', 'd', 'several', 'else', "they'll", 'describe', 'novel', 'e', 'better', 'that', 'exactly', 'who', 'people', 'want', 'none', 'course', 'june', 'without', 'me', 'sensible', 'sa', 'nevertheless', 'very', 'unless', 'presumably', 'needn', 'about', 'let', 'somewhat', 'whenever', 'indicate', 'such', 'mill', 'shan', 'before', '2012', 'ok', 'during', 'yun', 'us', 'due', 'come', 'que', 'appreciate', 'fire', 'themselves', 'within', 'insofar', 'name', 'everyone', 'are', 'forth', 'at', 'ones', 'believe', 'brief', 'secondly', 'th', 'everything', 'also', 'thanx', 'next', 'if', 'away', 'somehow', 'furthermore', 'seven', 'mostly', 'help', "it'll", "doesn't", 'took', 'perhaps', 'neither', 'what', "there's", "t's", 'less', 'apart', 'hereby', 'as', 'they', 'thereby', "needn't", 'should', 'other', 'near', 'went', 'hither', 'inasmuch', 'provides', 'cause', 'forty', 'de', "he'll", "wasn't", 'and', 'p', 'x', '9', 'anybody', "it'd", 'yahoo', 'corresponding', 'around', 'one', 'truly', 'hasn', 'formerly', 'out', 'hello', "mightn't", 'off', 'three', 'twelve', 'ought', 'she', 'which', 'theres', 'won', 'thoroughly', 'two', 'whither', 'causes', '8', 'became', 'call', 'u', 'mustn', 'any', 'h', 'need', 'becoming', 'homepage', 'fifty', "a's", 'almost', 'or', 'known', 'really', 'taken', 'edu', 'likely', 'where', 'we', 'have', "mustn't", 'given', 'ignored', 'nearly', 'uses', 'show', "she'll", 'ko', 'hence', "can't", 'unfortunately', 'november', 'respectively', 'j', 'r', "ain't", 'relatively', 'probably', 'et', 'theirs', "she's", 'fill', 'august', "won't", 'these', "c'mon", 'sunday', 'through', 'him', 'etc', 'regards', "who's", 'whom', 'thru', 'com', 'appropriate', 'knows', 'know', 'seeing', 'goes', 'below', "they'd", 'whereupon', 'na', 'con', "you'll", 'aside', 'old', '4', 'twice', 'across', 'give', 'obviously', 'its', '2013', 'therein', '7', 'ng', 'whatever', 'like', 'to', 'their'])
Yes but that is wrong as these are clearly four sentences.
Thanks!
1 - It's not completely true. Sumy uses nltk.word_tokenize
and the regex is used only to filter some words out. You are right it should not filter some words with -
or '
probably, but your version removes the NLTK completely and relies only on regex, and I am not sure this is OK for me. Especially when it's not hard to implement and use custom tokenizer with Sumy. Anyway, thanks for the info why you decided to go with more complicated Regex :)
3 - Yep, you can use these or any others. That's why I left the Sumy open for custom components.
4 - Yes, it is as I can see. Unfortunately, NLTK couldn't detect it. If you have a better implementation of the Python sentence tokenizer, I will be happy to test it and replace NLTK in Sumy with it 👍
Hey Mišo
I spent a lot of time on text rank and while digging deeper into Sumy I want to ask you a few clarifying questions about some of the choices you made: This is all for English language.
1)
_WORD_PATTERN = re.compile(r"^[^\W\d_]+$", re.UNICODE)
Used with word_tokenize() to filter 'non-word' words. The problem is it "kills" words like "data-mining" or "sugar-free". Also word_tokenize is very slow. Here is an alternative to replace these two to consider:
2) What made you choose Snowball vs Porter stemmer.
Snowball: DVDs -> dvds Porter: DVDs -> dvd
I don't have particual opinion just wondering how did you make the decision.
3) How did you come up with your stopwords (for english?) It is very different thatn nltk defaults for example.
4) Heuristics in plaintext parser are interesting.
In this example of text extracted from https://www.karoly.io/amazon-lightsail-review-2018/
This ends up as two sentences instead of four.