Open scottgonzalez opened 9 years ago
As https://github.com/scottgonzalez/github-export is in progress so do you think https://www.npmjs.com/package/github-request will satisfy all our needs on that matter or we shall have to tweak it to suit our needs?
github-export is just a tool built on top of github-request to do the exact type of queries you'll want. You won't have to tweak github-request regardless of whether you use github-export.
@scottgonzalez : I could not find any team information on GitHub. From where can i get this info ?? I wanted to know more about this feature and what are your expectations from this?
@g31pranjal Please see my earlier comment.
here are teams in jQuery and they look after the certain repositories. Agreed..
but I'm not sure what @agcolom had in mind.
I wanted to know what is required of this feature and how can I incorporate this in my proposal ? @scottgonzalez : Anything you feel could be regarding this.. @agcolom : please help out with this
Sorry, I'm not in front of a computer right now but my views are totally inline with those of @scottgonzalez
@scottgonzalez , @agcolom and @arschmitz : Should I calculate average age of issues/PRs for individual repositories or for all at once or both?
@gauravparmar As a minimum, it should be per repository so that we can measure and compare the health of each repo. In addition, altogether would be good also for comparison purposes.
Let's not focus on anything that spans every project until the features are done for the individual repositories. There's a lot of disjoint across projects and that will only increase over time.
@agcolom , @scottgonzalez and @arschmitz :
As per above discussion, I have understood that this project is about showing overall statistics for quick view, right? Hence showing ages of individual issues/PRs will not be required at this moment, will it be? Once the core features get completed, we can dive deeper into the details. What you all think?
What time unit should be used to show age, average age etc.?
That depends on the value. Use the most appropriate unit.
Is there a need to show the age of individual issues/PRs as well?
No.
Should I separate the count of open PRs from issues while showing "Number of Open Issues" and the count of closed PRs at the time of displaying "Number of Closed Issues" for individual repositories?
Yes, we should treat them as completely separate entities.
@scottgonzalez :
That depends on the value. Use the most appropriate unit.
You mean to say that I shall have to use multiple units depending on the value. e.g.- if the time is very short (let us say in seconds or minutes) then I should use small units (like - seconds or minutes) and if time is bigger (like some hours or days) then I should use a bigger time unit (like - hours or days) at that time because milliseconds or seconds would create a big number in that case. Is this what you said?
I shall have to decide range to switch among the time units as per the value.
@scottgonzalez Would that be ok with you?
You mean to say that I shall have to use multiple units depending on the value.
Yes.
I shall have to decide range to switch among the time units as per the value.
Just use an existing library like moment.js.
I'd go with days, as this is what is going to make sense when reviewing health of repos.
Would you rather see something like "6 years, 50 days" or "2,050 days"? I'm ok if you really want to limit the max unit to days, but youngest could be something like "2 minutes" and I don't think you want to see "0 days" or even more strangely "0.0014 days."
I think it'd be good to show the age of the youngest and oldest in each repo graphically with the average also, to get easily spot repos with high and low activity.
I don't think youngest and oldest will tell you anything about actual activity. You'll just see massively polarizing values in mature projects.
I'd go with days, as this is what is going to make sense when reviewing health of repos.
Would you rather see something like "6 years, 50 days" or "2,050 days"? I'm ok if you really want to limit the max unit to days, but youngest could be something like "2 minutes" and I don't think you want to see "0 days" or even more strangely "0.0014 days."
sure, yes, seen that way your suggestion is the best one.
I think it'd be good to show the age of the youngest and oldest in each repo graphically with the average also, to get easily spot repos with high and low activity.
I don't think youngest and oldest will tell you anything about actual activity. You'll just see massively polarizing values in mature projects.
What I was thinking is that if your max age is 14months and min age is 12months, then nothing's been happening for 1 year and is looking stale. However, you could have another repo with a similar average where min age could be 7 days and max age could be 3 years. That tells us there are still old issues to be resolved but activity is still taking place.
Technically all that tells you is someone recently filed an issue.
If your youngest issue is 12 months old, there are a few likely scenarios:
1) Your project is dead. Nobody uses it anymore and therefore nobody files issues anymore.
2) Your project is officially complete and stable. People still use the project, but by its nature it no longer requires updates and no new bugs are being discovered.
3) Your project is still active, every new bug gets fixed relatively quickly, but there are some old, nasty bugs that you just can't seem to tackle.
You actually cannot differentiate between the first two cases purely by looking at issue activity. You can differentiate the third case by looking at other issue metrics.
Yes. Now I also think that we can not stick to a particular time unit because it may result in some awkward number (like 0.003472222 days in case of 5 minutes).
@scottgonzalez so would it be a good idea to also have an option to view other things such as latest PR/issue (whether closed or not)?
I'm not sure what value that would provide. You'll get much more insight into the project activity with a chart like http://bugs.jqueryui.com/ticketgraph.
@scottgonzalez @agcolom : Can we display the recent activities on a particular repository in a given frame of time (say, 1 year, or rather, 1 month), apart from displaying overall activity ? This can serve two purposes :
Can this solve the problem of comparison as well as health of a repository ?
@g31pranjal it should be possible to select a timeframe. I think it'd be good to let the user do that either by selecting using a menu or graphically. Once we have all the data, this should not be a problem and charting libraries often offer this as default behavior.
@scottgonzalez , @agcolom , @arthurvr , @g31pranjal I also want to discuss a bit more on how to measure the health of the repository. There are many factors which should be considered: issues age range and avg. age, commit frequency, youngest issues and count, no. of watches, stars, forks made recently, etc. (mentors please suggest more relevant factor).
Example: If your issue frequency is decreasing overtime and you are having a lots of stars and forks recently then your code quality is improving, and becoming less buggy and more useful.
Similarly for the examples given by @scottgonzalez above
But my question is how to measure it and what weights should be given to each of the factor? Should we measure the health relative to other repos in Org. or also add some default manual values for each factor; or combine them both? And there are number of ways to analyse the same thing.
@moizpalitanawala one way to measure health could also be that the number of opened issues is under a specific number (to be defined) and that we're closing more issues than we open. We could also defined various levels of 'health'
@agcolom, @scottgonzalez We are also planning to provide this as an opensource tool for other organisations too.. So will have to have a very flexible health measuring system. Stats which are great for one organisation might be below normal for others. Other thing could be that we just keep "jquery org." needs in mind and release the tool and let them modify it according to their requirements. However this idea isn't appealing. The tool should just work if someone is using it!
@agcolom @moizpalitanawala Rather than defining 'health', wouldn't it be better to calculate and provide valuable insights about the repository based on the data we have in our hand, as was the initial focus, and draw conclusions based on a particular statistic ?? as in Web Analytics, it does not tell the state of your website but, the parameters on which the site is measured and their inferences
We can have 'health' feature, but since it is relative, it cannot be used to determine the current sate of a repository (even for repositories of jQuery org). eg. take repo jquery-ui : issues 0, Pull Rquests 1500+ (none opened in last 1 week) and jquery : issues 350+, Pull Rquests 1700+ (2 issues & 7 PR opened last week), both have nearly equal activities but no correlation between their stats.
In this case it will be difficult to device a common method to calculate overall health.
Please stick with the scope that was already defined.
A question that is coming to my mind again and again is that-
"Why should we use a database when we have to update it by checking Github each time a user visits the Issue/PR Tracking System?"
I mean each time when the user visits the Issue/PR Tracking System then we shall have to compare the information available in our database with the information on Github so that the user sees updated information. I totally understand that there will be some fields for which values will not change but still to verify that we shall have to make a comparison that means in order to show updated information to a user, we shall have to make lots of comparisons between the information on Github and information available in our database each time a user visits the Issue/PR Tracking System. So is not it a kind of overhead? I mean if we have to fetch the information from Github each time then why to use a database? Can not we directly make use of the information fetched from Github since some of the information become obsolete after a period?
I mean each time when the user visits the Issue/PR Tracking System then we shall have to compare the information available in our database with the information on Github so that the user sees updated information.
Absolutely not. We only care about what's in the database, which is updated on a regular basis.
Can not we directly make use of the information fetched from Github since some of the information become obsolete after a period?
You would never be able to do any kind of in-depth analysis in a reasonable amount of time and probably without hitting the request limit without a database.
@scottgonzalez : So the information presented to the user would come from the database which might have been updated some time before and hence it will not be a real time tracking as you said above in anwering someone else's question?
Now I am again clear about the database concept.
I totally agree with you on the matter that during analysis, if the analysis is done using the database data then it can be done faster than the analysis done using data directly requested from Github since the latter would involve so many requests resulting in consumption of a lot of time to get the data and analyze it further.
it will not be a real time tracking as you said above in anwering someone else's question?
Correct. See https://github.com/jquery/content/issues/4#issuecomment-83783181
Even if we had a full day delay between updates, the difference in actual stats would be negligible. Of course, we'll want to have more frequent updates than that, but the types of analysis mentioned above doesn't need real time tracking.
I have tried to store the data from 20/44 jQuery repositories hosted on Github and it is sufficient to crawl and update the data at max once per day. It hardly takes 20 requests on an average to Github API to do this.
However, i was thinking of some kind of mechanism that can give preference to a more active repository (like jquery/jquery, jquery/jquery-mobile) to update more frequently than less active repositories. This can be done by finding mean activity rate (MAR) on a particular repository using the data we have, then frequency of updating a repository will be proportional to its MAR.
I was planning on this method. is it the right way we can do this ? @scottgonzalez @agcolom
it is sufficient to crawl and update the data at max once per day
Sufficient for what? I'd say that's a minimum, not a maximum.
However, i was thinking of some kind of mechanism that can give preference to a more active repository
Why? The time to find out that a repository has had no activity is less than a second and only costs one API call, right? I don't see how adding complexity helps here.
@scottgonzalez
it is sufficient to crawl and update the data at max once per day
Sufficient for what? I'd say that's a minimum, not a maximum.
yeah, its the minimum. i agree.
Why? The time to find out that a repository has had no activity is less than a second and only costs one API call, right? I don't see how adding complexity helps here.
- I wanted to save upon the computation and recursive calls to the API. The "updated_at" for a repository gets updated even if there is a comment, change in description, wiki pages etc. Hence, there will be more frequent "trigger" to update, than we actually require (since we are interested in collecting only issues and pull requests).
- If there is update in around 20 of 44 repositories, fetching updates for each of the 20 repositories together may reach request limit. (since i will be tracking all the open issues & pr across all repos).
now the question is, can we update our entire data at once without reaching request limits (considering there can be more repos in future..) ? if yes, then there is no need for adding complexity !!
The "updated_at" for a repository gets updated even if there is a comment, change in description, wiki pages etc.
What "updated_at"? Why are you not just asking for all issues/PRs created after the last update time? See the since
parameter in https://developer.github.com/v3/issues/#list-issues-for-a-repository.
@scottgonzalez
using since
is an easy way to keep the issues updated.
And then updating all the repos at once makes proper sense. I was actually looking for a way to avoid updating all the open issues every time. Thanks for the piece of advice...
another thing, we also need to keep track of the repositories currently on Github (so as to get info about any new repo being opened), does it makes sense to do this thing automatically or there will be some way to add a particular repository to track ?
To avoid multiple people building the same thing with different ideas, I'm just going to stop answering implementation questions until there's a single proper implementation to be discussing.
@clarkbox FYI
@agcolom requested that I open a single issue with all requests.