mozilla / bugbug

Platform for Machine Learning projects on Software Engineering
Mozilla Public License 2.0
493 stars 309 forks source link

Train spambug model on components from all products #2786

Closed marco-c closed 2 weeks ago

marco-c commented 2 years ago

We apply the model to bugs from Thunderbird too for example, but we only train on Firefox, Toolkit, etc. bugs. This means Thunderbird bugs are more likely to be wrongly marked as spam.

To fix this, we have to change our bug mining script (https://github.com/mozilla/bugbug/blob/18d270df3cdaa639093a37bde96bed639fb7470b/bugbug/bugzilla.py#L53) to also get Thunderbird (and a few other products: "Calendar", "Chat Core", "MailNews Core") bugs, but then we have to exclude these products from all other models and only use them in the spambug model.

Amotul-raheem commented 1 year ago

@marco-c and @suhaibmujahid I started looking into this, and I think spambug is trained across all products in bugs.json. While I was debugging I found 114 unique products, part of which include the following 'Product Innovation', 'Data Platform and Tools Graveyard', 'Firefox for iOS', 'Penelope Graveyard', 'addons.mozilla.org', 'Hello (Loop)', 'NSS', 'SeaMonkey', 'Pocket', 'B2GDroid Graveyard', 'Core', 'User Research', 'Developer Documentation Graveyard', 'UX Systems', 'Other Applications', 'Remote Protocol', 'Mozilla Labs', 'bugzilla.mozilla.org', 'Firefox', 'Firefox Health Report Graveyard', 'NSPR', 'Documentation Graveyard', 'Web Compatibility', 'Websites Graveyard', 'Chat Core', 'External Software Affecting Firefox', 'Developer Services', 'Developer Engagement', 'DevTools', 'developer.mozilla.org', 'Taskcluster', 'Localization Infrastructure and Tools', 'Tracking', 'Mozilla QA Graveyard', 'addons.mozilla.org Graveyard', 'Marketplace Graveyard', 'Input Graveyard', 'Firefox for Android Graveyard', 'Thunderbird', 'www.mozilla.org', 'Webtools', 'Conduit', 'Tree Management Graveyard', 'Data & BI Services Team', 'developer.mozilla.org Graveyard', 'DevTools Graveyard', 'Add-on SDK Graveyard', 'Webmaker Graveyard', 'Tech Evangelism Graveyard', 'Mozilla Metrics', 'Other Applications Graveyard', 'MailNews Core Graveyard', 'Webtools Graveyard', 'Developer Ecosystem', 'Toolkit Graveyard', 'Infrastructure & Operations Graveyard', 'Plugins Graveyard', 'bugzilla.mozilla.org Graveyard', 'Community Building', 'Firefox Affiliates Graveyard', 'mozilla.org', 'Core Graveyard', 'Minimo Graveyard', 'Location', 'Instantbird', 'MailNews Core', 'Mozilla Localizations', 'Websites', 'User Experience Design', 'Release Engineering Graveyard', 'Testing', 'Participation Infrastructure', 'Mozilla Foundation Communications', 'Mozilla Reps Graveyard', 'MozillaClassic Graveyard', 'Firefox Graveyard', 'Data Platform and Tools', 'Firefox Build System', 'Infrastructure & Operations', 'Invalid Bugs', 'Socorro', 'support.mozilla.org', 'Directory', 'Security Assurance', 'Developer Infrastructure', 'Testing Graveyard', 'Lockwise Graveyard', 'Cloud Services', 'Firefox for Metro Graveyard', 'Air Mozilla', 'Mozilla Labs Graveyard', 'L20n', 'External Software Affecting Firefox Graveyard', 'Camino Graveyard', 'Release Engineering', 'Snippets', 'WebExtensions', 'Content Services Graveyard', 'Calendar', 'Cloud Services Graveyard'...

Please correct me if I'm missing something here.

marco-c commented 1 year ago

@Amotul-raheem I think there are only few bugs from Thunderbird because we are not retrieving them as part of https://github.com/mozilla/bugbug/blob/7a3a81ec4ba6b279fd37ec1f94847f6f74957a0f/bugbug/bugzilla.py#L28.

Normally, the bugzilla retriever script retrieves bugs we have manual labels for (https://github.com/mozilla/bugbug/tree/master/bugbug/labels) + bugs from that list of products. Probably you see some Thunderbird bugs because we have some manual labels for them, but they are too few for the spambug model.

Amotul-raheem commented 1 year ago

@marco-c Wrote a quick function to give me the bug frequency for each product and I found this. Thunderbird bugs appeared 1249 times

{ 'Core': 23868, 'Firefox': 6694, 'Invalid Bugs': 3995, 'Toolkit': 2883, 'DevTools': 2521, 'Firefox Build System': 1924, 'Testing': 1565, 'Thunderbird': 1249, 'NSS': 801, 'GeckoView': 677, 'WebExtensions': 620, 'Cloud Services': 507, 'Remote Protocol': 408, 'SeaMonkey': 400, 'Data Platform and Tools': 386, 'Developer Infrastructure': 330, 'Firefox for Android Graveyard': 325, 'bugzilla.mozilla.org': 304, 'MailNews Core': 302, 'Release Engineering': 290, 'Core Graveyard': 259, 'Web Compatibility': 244, 'Firefox OS Graveyard': 196, 'Taskcluster': 140, 'Tree Management': 113, 'Calendar': 112, 'Infrastructure & Operations': 84, 'Fenix': 58, 'External Software Affecting Firefox': 51, 'Webtools': 38, 'DevTools Graveyard': 33, 'Socorro': 33, 'mozilla.org': 32, 'Mozilla Localizations': 30, 'Chat Core': 29, 'Firefox Graveyard': 26, 'Conduit': 25, 'Localization Infrastructure and Tools': 24, 'Bugzilla': 18, 'Firefox for iOS': 18, 'Data Science': 18, 'NSPR': 16, 'Add-on SDK Graveyard': 15, 'Mozilla QA Graveyard': 14, 'Developer Services': 13, 'www.mozilla.org': 13, 'support.mozilla.org': 12, 'Infrastructure & Operations Graveyard': 12, 'Testing Graveyard': 11, 'Websites': 9, 'Toolkit Graveyard': 9, 'Webtools Graveyard': 8, 'Firefox for Metro Graveyard': 8, 'Shield': 7, 'developer.mozilla.org Graveyard': 7, 'Other Applications': 7, 'Developer Documentation Graveyard': 6, 'Camino Graveyard': 5, 'Hello (Loop)': 4, 'Tech Evangelism Graveyard': 4, 'Mozilla Reps Graveyard': 3, 'Tree Management Graveyard': 3, 'addons.mozilla.org Graveyard': 3, 'Marketplace Graveyard': 3, 'Participation Infrastructure': 2, 'Data Platform and Tools Graveyard': 2, 'Firefox Health Report Graveyard': 2, 'Tracking Graveyard': 2, 'Other Applications Graveyard': 2, 'Tecken': 1, 'Penelope Graveyard': 1, 'Rhino Graveyard': 1, 'Firefox Affiliates Graveyard': 0, 'Firefox for FireTV Graveyard': 0, 'User Research': 0, 'Data & BI Services Team': 0, 'Location': 0, 'Mozilla Foundation Communications': 0, 'Mozilla Metrics': 0, 'Release Engineering Graveyard': 0, 'Tracking': 0, 'Cloud Services Graveyard': 0, 'Mozilla Labs': 0, 'Web Apps Graveyard': 0, 'Minimo Graveyard': 0, 'Mozilla Labs Graveyard': 0, 'B2GDroid Graveyard': 0, 'Directory': 0 }

Also, I spent some time looking into the training process for spambug and this is what I found:

Can't seem to find where we check against the PRODUCTS in the file referenced above for spambug model, the lables seem to be used mostly in defect, defect_enhancement_task model etc

suhaibmujahid commented 1 year ago

@Amotul-raheem Thank you for doing this!

Thunderbird bugs appeared 1249 times

Thunderbird product has more than 62K bugs, out of that 9.4K created in the last 30 months only.

Can't seem to find where we check against the PRODUCTS in the file referenced above for spambug model, the lables seem to be used mostly in defect, defect_enhancement_task model etc

We check against the defined products here: https://github.com/mozilla/bugbug/blob/19f43fc1abb40faf2c273d4ad59d2e633523775d/bugbug/bugzilla.py#L161-L167

Then we combine these with the labelled bugs: https://github.com/mozilla/bugbug/blob/19f43fc1abb40faf2c273d4ad59d2e633523775d/scripts/bug_retriever.py#L59-L70

The retriever process generates the data/bugs.json file, which is used in all models.

Amotul-raheem commented 1 year ago

@suhaibmujahid Got it, makes sense. Thank you! I'll spend some more time looking into it.

Amotul-raheem commented 1 year ago

@Amotul-raheem Thank you for doing this!

Thunderbird bugs appeared 1249 times

Thunderbird product has more than 62K bugs, out of that 9.4K created in the last 30 months only.

Can't seem to find where we check against the PRODUCTS in the file referenced above for spambug model, the lables seem to be used mostly in defect, defect_enhancement_task model etc

We check against the defined products here:

https://github.com/mozilla/bugbug/blob/19f43fc1abb40faf2c273d4ad59d2e633523775d/bugbug/bugzilla.py#L161-L167

Then we combine these with the labeled bugs:

https://github.com/mozilla/bugbug/blob/19f43fc1abb40faf2c273d4ad59d2e633523775d/scripts/bug_retriever.py#L59-L70

The retriever process generates the data/bugs.json file, which is used in all models.

I spent some time looking into the code base and I was looking at the wrong place initially. Basically, I was running the trainer.py with spambug as an argument locally therefore I was just downloading the bug.json file. However, After checking the data pipeline and I looked through the model dependency I figured out how the data was retrieved. Thanks for this!

marco-c commented 2 weeks ago

@jpangas looking again at your PR, there's something missing. You added additional products here https://github.com/mozilla/bugbug/blob/e1b8ced9c19e51c88501c1ada8e4e4ec2c5cdae4/bugbug/models/spambug.py#L138 but not here https://github.com/mozilla/bugbug/blob/e1b8ced9c19e51c88501c1ada8e4e4ec2c5cdae4/bugbug/models/spambug.py#L96.

jpangas commented 2 weeks ago

Thanks @marco-c, made a PR ( #4432 ) including the change.