Open marco-c opened 5 years ago
@marco-c and @suhaibmujahid I'm picking this up, how do we come up with words that we think should be part of the stopwords or shouldn't?
Feel free to research the way that you see suitable. The following are just suggestions:
Hi @suhaibmujahid, After spending some time calculating the reverted IDF, I found that there are some domain-specific words that can be added to the stop words. Below are the top 200 words which include both domain and non-domain-specific words, with IDF ranging from 1.323716(lowest) to 3.794723.
['js', 'profile', 'layout', 'don', 'try', '22', 'then', 'work', 'src', 'click', '17', 'remove', 'crash', 'enabled', 'hex', 'application', 'failed', 'base', 'ok', 'x86', 'does', 'make', 'same', '36', '2021', 'issue', 'here', '18', 'could', 'upstream', 'win64', 'finished', 'result', 'need', 'would', 'other', 'update', 'http', '13', 'messages', 'time', '20', 'found', '16', 'any', 'line', 'x64', 'thread', 'name', 'worker', 'wpt', '14', 'took', 'pull', 'warning', 'data', 'pr', 'you', 'tab', 'set', 'show', 'true', 'message', 'follow', 'do', 'linux', 'nt', 'complete', 'dom', 'more', 'builds', 'pass', 'url', 'sync', '64', 'also', '12', 'details', 'main', 'up', 'get', '15', 'number', '11', 'autoland', 'logviewer', 'run', 'using', 'window', 'closed', 'some', 'which', 'add', 'there', 'one', 'out', 'after', 'has', '20100101', 'page', 'process', 'chrome', 'fail', 'attachment', 'content', 'like', 'open', 'about', 'github', 'code', 'see', 'platform', 'so', 'into', 'all', 'use', 'was', 'tc', 'rv', 'only', 'will', 'unexpected', 'parsed', 'build', 'web', 'filed', 'new', 'ci', 'backing', 'start', 'error', 'created', 'queue', 'v1', 'artifacts', 'logs', 'live', 'services', 'runs', 'intermittent', 'public', 'an', 'full', 'windows', 'agent', 'job', 'info', 'api', 'no', 'task', 'browser', 'repo', 'log', 'treeherder', 'if', 'but', 'or', 'have', 'are', 'can', 'results', 'tests', 'user', 'actual', 'id', 'central', 'str', 'we', '10', 'as', 'test', 'gecko', 'expected', 'when', 'at', 'should', 'by', 'that', 'it', 'from', 'bug', 'be', 'not', 'with', 'reference', 'com', 'org', 'file', 'on', 'firefox', 'of', 'for', 'and', 'this', 'https', 'is', 'in', 'mozilla', 'to', 'the']
while the following 200 words appear least frequently, with the IDF ranging from 13.163092(highest) to 12.469945.
['b', 'h', 't', '3', 'e', 'i', 'v', '5', 'u', 'l', 'z', 'k', 'g', 'o', 'w', 'y', '1', 'x', 'd', '0', '8', 'c', 'j', 's', 'q', '7', '9', '4', 'a', 'm', 'r', '6', 'f', 'n', 'p', '2', '1647451970689', 'disksizegb', '0y9sk1q1wfj5d06y0nea', 'disktype', 'disktypes', 'machinetype', 'networkinterfaces', 'cascadelake', 'accessconfigs', 'automaticrestart', 'onhostmaintenance', '7fd68c6f9600', 'devicemanagement', 'mincpuplatform', 'l317', 'accumulatesamples', '0df1eb2b6e985b623152060139f4ebb701dfd021', '371282010', 'tmpbym6cpgdpidlog', 'riu', '1647452067259', '1647452062779', '9268052', 'cabos', '7fd689faab00', 'palace', '9268056', 'tmpi4547b1x', 'microwaves', '1wmroxtrfgpv5vzjrbiwg4oxhdgbrihle', 'saples', '7fd689face00', 'hostsharedmemory', '27prompt', 'afteridleseconds', '9268080', '1647451990785', 'tmpvv3u0s0e', 'a6b5b3dfefb10fb23e5c9a9a1340582572b3ac6a', '390752', '0d20df2b', 'ae9a90220315', '56128', '1750623', '371278818', 'lytmovgessw9a8vgsndshw', 'integrators', 'mcookie', 'onnotificationclick', 'fuzzlesoft', '9268113', '371280640', 'gk8lcv', 'utpsickamu0j5tw', 'cf9488f918cc', 'accumulatepageloadtelemetry', '18db', 'tmpwf49ouft', '7fd691df8e00', '1647452082482', 'alloow', '9268086', '9268089', 'testfactoryimpl', 'goodsurrogatepair', '1647452062500', '36356', '1647451966575', '371271074', 'tmpdt4nmm2s', '1647451959871', 'du9', 'zrvtrrmlqyvmxfb1tq', '164744508432201', 'nkom2sdltdynf7oqlzs7fg', '7b401dab', '180946ms', '371238201', '4fd3', 'ea3703f63c4c', '371218636', 'adkxzbq5rdiriknoielatw', '6641mb', '12533ms', '178156ms', '600000023841858', '371216015', 'zrpgvls3soybwe2pagxe1q', '1757431', '371218855', 't8yeaqgzqu6msadcvrxzeq', '371216458', 'l1vf6z1bqgu', 'caqfkfkbnq', '1647407986', '716168', '002583', '008009', '7d905163', '6p6stqs8', '1917ms', 'zq83r5wnquqsxdwzhm40lq', '1647395176452', '1647395176454', '1647395176463', '1647395176949', '1647395176953', '1647395176955', '1647395176957', '1647395176960', '1647395176', '005426', '1647395176961', '6nhfcjjtbolo', '1647395176965', '1712170', 'd116863', '371210870', 'fwyz3glws7svhyrqpu43xw', '20220316041325', 'a2742e46efb56e290f294f13ab8aaa1a5fe666c4', '371212009', '7pz91hcrlqhl', '055703', '047694', '269519', 'kolejowe', 'opowiesci', 'koko', 'slamazara', 'odc', '58403203', 'chuggington', 's06e01', 'dubbing', 'stacyjkowo', 'gemius', 'aed0e880', 'f307', '2d604bfa5e95', '1753949', 'd138263', 'ufwhr5g3qfg', 'zmp0bi9rs0ss', 'uifcj2n', 'c1edbbf2', 'adocean', '4654c3f89c99', 'yctzjbf2q0y4nreyc2t8ig', 'wjb8tsxkiq3pfhzryrw', '71b74640', '95dd', '61c07f58be6e', 'tmp27mol3f1', '0692f741', '1d88', '861735783b6e', 'rtx3090', 'bc104f15a48be709dca542a7e1d4b9df6f054527e802eaa92d595444258afe71', '371235817', '1759817', '20220315091352', '9267999', 'es2022']
Some of them have been filtered out in the similarity script logic either because they fall into the NLTK stopwords or they have one character.
N.B: This analysis is based on the first comments only. I can extend to all comments if needed
Find attached a file containing each word and its respective IDF and reverted IDF bugs_idf.zip
Thank you @Amotul-raheem ! Very nice work 👍
Some of them have been filtered out in the similarity script logic either because they fall into the NLTK stopwords or they have one character.
Could you please remove these cases?
N.B: This analysis is based on the first comments only. I can extend to all comments if needed
This should be enough for now. You can try if you want and see if the results show a different perspective.
The next step will be to select the best candidates for these lists. Also, we want to check if there is a need to drop some NLTK stopwords.
@suhaibmujahid Thanks, I'll remove those that already exist in NLTK stopwords and the single characters. Also, I forgot to respond to the dropping of some of the NLTK stopwords. After looking at the words(179 words) I think they are fine, I can't see anything that needs to be removed from them. That being said, I'll continue to work on selecting the best candidates for the stopwords.
Thanks!
@suhaibmujahid
After doing some downsizing from the top 1000 most frequent words i.e low idf, these are some words that I think will be good candidates for the stopwords.
'rsi', 'rdx', 'amd64', 'rdi', 'rcx', 'r13', 'r15', 'rax', 'r14', 'rbx', 'rbp', 'r10', 'rsp', 'r12', 'r9', 'var', 'webrtc', 'r11', 'nsithread', 'r8', 'iframe', 'wiki', 'plugin', 'login', 'opt', 'e10s', 'links', 'libxul', 'tree', 'node', 'exe', 'dmp', 'async', 'mach', 'blobber', 'docshell', 'xre', 'gre', 'init', 'dist', 'crashreporter', 'pushloghtml', 'mochikit', 'though', 'appdata', 'every', 'es', 'geckoview', 'temp', 'messageloop', 'etc', 'recv', 'python', 'args', 'addon', 'much', 'website', 'bin', 'self', 'mozrunner', 'around', 'many', 'mochitest', 'char', 'core', 'mochitests', 'cgi', 'nsthread', 'obj', 'addons', 'either', 'string', 'gpu', 'maybe', 'bool', 'simpletest', 'io', 'taskcluster', 'int', 'patch', 'xul', 'macintosh', 'webrender', 'however', 'let', 'gfx', 'ui', 'lib', 'residentfast', 'vsize', 'bugzilla', 'bit', 'might', 'void', 'tmp', 'const', 'bugs', 'ubuntu', 'reftest', 'google', 'macos', 'net', 'applewebkit', 'khtml', 'devtools', 'cc', '0a1', 'xhtml', 'xpcom', 'dll', 'en', 'marionette', 'css', 'toolkit', 'mac', 'android', 'instead', 'safari', 'www', 'javascript', 'moz', 'x11', 'ipc', 'searchfox', 'ns', 'html', 'pid', 'os', 'console', 'hg', 'chromium', 'js', 'hex', 'application', 'x86', 'issue', 'could', 'win64', 'would', 'http', 'x64', 'wpt', 'pr', 'linux', 'nt', 'dom', 'builds', 'url', 'sync', 'also', 'autoland', 'logviewer', 'window', 'chrome', 'github', 'code', 'tc', 'rv', 'web', 'ci', 'v1', 'logs', 'windows', 'agent', 'api', 'browser', 'repo', 'log', 'treeherder', 'id', 'str', 'gecko', 'bug', 'com', 'org', 'firefox', 'https', 'mozilla'
Attached below is also the file containing the bottom 1000 least occurring words with high idf.
Please let me know what you think.
Thanks
We should check what stopwords nltk is using, see if some of them are actually meaningful for us, and add new ones (e.g. "Firefox" could probably be considered as a stopword for us, since it's everywhere :P).