mantono / DuplicateSearcher

Identification of Duplicate Tickets in Issue Tracking Systems for Software Development
0 stars 0 forks source link

Filter out pull requests from issues #16

Closed mantono closed 8 years ago

mantono commented 8 years ago

From https://developer.github.com/v3/issues/#list-issues-for-a-repository

Note: In the past, pull requests and issues were more closely aligned than they are now. As far as the API is concerned, every pull request is an issue, but not every issue is a pull request.

_This endpoint may also return pull requests in the response. If an issue is a pull request, the object will include a pullrequest key.

We will have to check the _pullrequest key and remove pull requests from our issue collection, as they will contribute anything to our artefact.

mantono commented 8 years ago

After doing some testing, I have come to the conclusion that pull requests, in fact, is not included when fetching regular issue from the API. I am still clueless to why we get more downloaded issues than it currently exists in the repository, but it seems like it has nothing to do with pull requests anyway.

mantono commented 8 years ago

It seems like I was wrong :( This issue was in the tdesktop data set, which caused the program to crash due to its empty body which resulted in a NullPointerException. This bug has now been fixed, but the offending issue is a pull request and not a regular issue, which means that the filtering is not done.

mantono commented 8 years ago

After some investigation, it is has now been found that pull requests can now be filtered with if(entry.getKey().getPullRequest().getHtmlUrl() != null) If the URL is null, then it is not a pull request. Unforunately, all previous downloaded data was downloaded without this check, AND all issues which do not have a comment was discarded as well (#22). We will therefore have to download all data sets again (#1).