mantono / DuplicateSearcher

Identification of Duplicate Tickets in Issue Tracking Systems for Software Development
0 stars 0 forks source link

Filter out URLs from issues and comments #24

Closed mantono closed 8 years ago

mantono commented 8 years ago

URLs does not offer any additional value, especially since they are broken up into smaller tokens which does not keep the context or intent of posting the URL. Certain parts of it will rather be detrimental to the identification of algorithms, since almost every issue containing a URL will contain either http or https as a token, while have possibly nothing else in common. All URLs should therefore be filtered, however, it is important that any mention of http or https is kept when it is not part of a URL.