mozilla / bugbug

Platform for Machine Learning projects on Software Engineering
Mozilla Public License 2.0
504 stars 311 forks source link

Improve stopword removal in the similarity script #710

Open marco-c opened 5 years ago

marco-c commented 5 years ago

We should check what stopwords nltk is using, see if some of them are actually meaningful for us, and add new ones (e.g. "Firefox" could probably be considered as a stopword for us, since it's everywhere :P).

Amotul-raheem commented 2 years ago

@marco-c and @suhaibmujahid I'm picking this up, how do we come up with words that we think should be part of the stopwords or shouldn't?

suhaibmujahid commented 2 years ago

Feel free to research the way that you see suitable. The following are just suggestions:

Amotul-raheem commented 2 years ago

Hi @suhaibmujahid, After spending some time calculating the reverted IDF, I found that there are some domain-specific words that can be added to the stop words. Below are the top 200 words which include both domain and non-domain-specific words, with IDF ranging from 1.323716(lowest) to 3.794723.

['js', 'profile', 'layout', 'don', 'try', '22', 'then', 'work', 'src', 'click', '17', 'remove', 'crash', 'enabled', 'hex', 'application', 'failed', 'base', 'ok', 'x86', 'does', 'make', 'same', '36', '2021', 'issue', 'here', '18', 'could', 'upstream', 'win64', 'finished', 'result', 'need', 'would', 'other', 'update', 'http', '13', 'messages', 'time', '20', 'found', '16', 'any', 'line', 'x64', 'thread', 'name', 'worker', 'wpt', '14', 'took', 'pull', 'warning', 'data', 'pr', 'you', 'tab', 'set', 'show', 'true', 'message', 'follow', 'do', 'linux', 'nt', 'complete', 'dom', 'more', 'builds', 'pass', 'url', 'sync', '64', 'also', '12', 'details', 'main', 'up', 'get', '15', 'number', '11', 'autoland', 'logviewer', 'run', 'using', 'window', 'closed', 'some', 'which', 'add', 'there', 'one', 'out', 'after', 'has', '20100101', 'page', 'process', 'chrome', 'fail', 'attachment', 'content', 'like', 'open', 'about', 'github', 'code', 'see', 'platform', 'so', 'into', 'all', 'use', 'was', 'tc', 'rv', 'only', 'will', 'unexpected', 'parsed', 'build', 'web', 'filed', 'new', 'ci', 'backing', 'start', 'error', 'created', 'queue', 'v1', 'artifacts', 'logs', 'live', 'services', 'runs', 'intermittent', 'public', 'an', 'full', 'windows', 'agent', 'job', 'info', 'api', 'no', 'task', 'browser', 'repo', 'log', 'treeherder', 'if', 'but', 'or', 'have', 'are', 'can', 'results', 'tests', 'user', 'actual', 'id', 'central', 'str', 'we', '10', 'as', 'test', 'gecko', 'expected', 'when', 'at', 'should', 'by', 'that', 'it', 'from', 'bug', 'be', 'not', 'with', 'reference', 'com', 'org', 'file', 'on', 'firefox', 'of', 'for', 'and', 'this', 'https', 'is', 'in', 'mozilla', 'to', 'the']

while the following 200 words appear least frequently, with the IDF ranging from 13.163092(highest) to 12.469945.

['b', 'h', 't', '3', 'e', 'i', 'v', '5', 'u', 'l', 'z', 'k', 'g', 'o', 'w', 'y', '1', 'x', 'd', '0', '8', 'c', 'j', 's', 'q', '7', '9', '4', 'a', 'm', 'r', '6', 'f', 'n', 'p', '2', '1647451970689', 'disksizegb', '0y9sk1q1wfj5d06y0nea', 'disktype', 'disktypes', 'machinetype', 'networkinterfaces', 'cascadelake', 'accessconfigs', 'automaticrestart', 'onhostmaintenance', '7fd68c6f9600', 'devicemanagement', 'mincpuplatform', 'l317', 'accumulatesamples', '0df1eb2b6e985b623152060139f4ebb701dfd021', '371282010', 'tmpbym6cpgdpidlog', 'riu', '1647452067259', '1647452062779', '9268052', 'cabos', '7fd689faab00', 'palace', '9268056', 'tmpi4547b1x', 'microwaves', '1wmroxtrfgpv5vzjrbiwg4oxhdgbrihle', 'saples', '7fd689face00', 'hostsharedmemory', '27prompt', 'afteridleseconds', '9268080', '1647451990785', 'tmpvv3u0s0e', 'a6b5b3dfefb10fb23e5c9a9a1340582572b3ac6a', '390752', '0d20df2b', 'ae9a90220315', '56128', '1750623', '371278818', 'lytmovgessw9a8vgsndshw', 'integrators', 'mcookie', 'onnotificationclick', 'fuzzlesoft', '9268113', '371280640', 'gk8lcv', 'utpsickamu0j5tw', 'cf9488f918cc', 'accumulatepageloadtelemetry', '18db', 'tmpwf49ouft', '7fd691df8e00', '1647452082482', 'alloow', '9268086', '9268089', 'testfactoryimpl', 'goodsurrogatepair', '1647452062500', '36356', '1647451966575', '371271074', 'tmpdt4nmm2s', '1647451959871', 'du9', 'zrvtrrmlqyvmxfb1tq', '164744508432201', 'nkom2sdltdynf7oqlzs7fg', '7b401dab', '180946ms', '371238201', '4fd3', 'ea3703f63c4c', '371218636', 'adkxzbq5rdiriknoielatw', '6641mb', '12533ms', '178156ms', '600000023841858', '371216015', 'zrpgvls3soybwe2pagxe1q', '1757431', '371218855', 't8yeaqgzqu6msadcvrxzeq', '371216458', 'l1vf6z1bqgu', 'caqfkfkbnq', '1647407986', '716168', '002583', '008009', '7d905163', '6p6stqs8', '1917ms', 'zq83r5wnquqsxdwzhm40lq', '1647395176452', '1647395176454', '1647395176463', '1647395176949', '1647395176953', '1647395176955', '1647395176957', '1647395176960', '1647395176', '005426', '1647395176961', '6nhfcjjtbolo', '1647395176965', '1712170', 'd116863', '371210870', 'fwyz3glws7svhyrqpu43xw', '20220316041325', 'a2742e46efb56e290f294f13ab8aaa1a5fe666c4', '371212009', '7pz91hcrlqhl', '055703', '047694', '269519', 'kolejowe', 'opowiesci', 'koko', 'slamazara', 'odc', '58403203', 'chuggington', 's06e01', 'dubbing', 'stacyjkowo', 'gemius', 'aed0e880', 'f307', '2d604bfa5e95', '1753949', 'd138263', 'ufwhr5g3qfg', 'zmp0bi9rs0ss', 'uifcj2n', 'c1edbbf2', 'adocean', '4654c3f89c99', 'yctzjbf2q0y4nreyc2t8ig', 'wjb8tsxkiq3pfhzryrw', '71b74640', '95dd', '61c07f58be6e', 'tmp27mol3f1', '0692f741', '1d88', '861735783b6e', 'rtx3090', 'bc104f15a48be709dca542a7e1d4b9df6f054527e802eaa92d595444258afe71', '371235817', '1759817', '20220315091352', '9267999', 'es2022']

Some of them have been filtered out in the similarity script logic either because they fall into the NLTK stopwords or they have one character.

N.B: This analysis is based on the first comments only. I can extend to all comments if needed

Find attached a file containing each word and its respective IDF and reverted IDF bugs_idf.zip

suhaibmujahid commented 2 years ago

Thank you @Amotul-raheem ! Very nice work 👍

Some of them have been filtered out in the similarity script logic either because they fall into the NLTK stopwords or they have one character.

Could you please remove these cases?

N.B: This analysis is based on the first comments only. I can extend to all comments if needed

This should be enough for now. You can try if you want and see if the results show a different perspective.

The next step will be to select the best candidates for these lists. Also, we want to check if there is a need to drop some NLTK stopwords.

Amotul-raheem commented 2 years ago

@suhaibmujahid Thanks, I'll remove those that already exist in NLTK stopwords and the single characters. Also, I forgot to respond to the dropping of some of the NLTK stopwords. After looking at the words(179 words) I think they are fine, I can't see anything that needs to be removed from them. That being said, I'll continue to work on selecting the best candidates for the stopwords.

Thanks!

Amotul-raheem commented 2 years ago

@suhaibmujahid

After doing some downsizing from the top 1000 most frequent words i.e low idf, these are some words that I think will be good candidates for the stopwords. 'rsi', 'rdx', 'amd64', 'rdi', 'rcx', 'r13', 'r15', 'rax', 'r14', 'rbx', 'rbp', 'r10', 'rsp', 'r12', 'r9', 'var', 'webrtc', 'r11', 'nsithread', 'r8', 'iframe', 'wiki', 'plugin', 'login', 'opt', 'e10s', 'links', 'libxul', 'tree', 'node', 'exe', 'dmp', 'async', 'mach', 'blobber', 'docshell', 'xre', 'gre', 'init', 'dist', 'crashreporter', 'pushloghtml', 'mochikit', 'though', 'appdata', 'every', 'es', 'geckoview', 'temp', 'messageloop', 'etc', 'recv', 'python', 'args', 'addon', 'much', 'website', 'bin', 'self', 'mozrunner', 'around', 'many', 'mochitest', 'char', 'core', 'mochitests', 'cgi', 'nsthread', 'obj', 'addons', 'either', 'string', 'gpu', 'maybe', 'bool', 'simpletest', 'io', 'taskcluster', 'int', 'patch', 'xul', 'macintosh', 'webrender', 'however', 'let', 'gfx', 'ui', 'lib', 'residentfast', 'vsize', 'bugzilla', 'bit', 'might', 'void', 'tmp', 'const', 'bugs', 'ubuntu', 'reftest', 'google', 'macos', 'net', 'applewebkit', 'khtml', 'devtools', 'cc', '0a1', 'xhtml', 'xpcom', 'dll', 'en', 'marionette', 'css', 'toolkit', 'mac', 'android', 'instead', 'safari', 'www', 'javascript', 'moz', 'x11', 'ipc', 'searchfox', 'ns', 'html', 'pid', 'os', 'console', 'hg', 'chromium', 'js', 'hex', 'application', 'x86', 'issue', 'could', 'win64', 'would', 'http', 'x64', 'wpt', 'pr', 'linux', 'nt', 'dom', 'builds', 'url', 'sync', 'also', 'autoland', 'logviewer', 'window', 'chrome', 'github', 'code', 'tc', 'rv', 'web', 'ci', 'v1', 'logs', 'windows', 'agent', 'api', 'browser', 'repo', 'log', 'treeherder', 'id', 'str', 'gecko', 'bug', 'com', 'org', 'firefox', 'https', 'mozilla'

Attached below is also the file containing the bottom 1000 least occurring words with high idf.

high_idf_words.txt

Please let me know what you think.

Thanks