mantono / DuplicateSearcher

Identification of Duplicate Tickets in Issue Tracking Systems for Software Development
0 stars 0 forks source link

Filter "Low Quality" issues based on average size of issues #66

Open mantono opened 7 years ago

mantono commented 7 years ago

Considering an Issue to be of low quality (maybe low quantity is more fitting) only because its content length is below a certain threshold of words, is possibly not the best option. More and more projects use issue templates, and only the template itself may include as many as 50 words, which would make such a feature as filtering on fixed size (especially such a low one) more or less useless. But not removing issues that does not contain much descriptive information, will result in issues like this one to get a cosine similarity to other (proper) issues on about 0.50, which is not satisfactory at all.

A possible solution is to count how many words (or unique terms, undecided :question:) each issue contains and compute the average. Any issues that contains less than 25% of the average issue is to be considered of low quality/quantity. The number 25 can of course be discussed, a lower or higher (most likely lower) number may be realistic as well.