oqtane / DNF.Projects

Sample Oqtane module demonstrating a scheduled job and JSInterop visualizations using Chart.js. This module powers the trend analysis on https://dnfprojects.org.
https://dnfprojects.org/
MIT License
25 stars 17 forks source link

Reporting for .NET nanoFramework activity should be improved #6

Closed josesimoes closed 3 months ago

josesimoes commented 1 year ago

Hi,

I'm with the Core Team of the project and I was looking at the project activity report following the Foundation newsletter. It turns out that the project activity it's reporting only the Home repo of the GitHub organization. We are using the Home repo only for tracking issues. All the development effort goes in all the other repos and, because of this, it's getting unnoticed. Please take a look at: https://github.com/orgs/nanoframework/insights?period=year.

How can this be improved? I'm OK with helping with this just need your guidance.

sbwalker commented 1 year ago

You are correct that the service collecting the metrics on dnfprojects.com currently only looks at the primary repository for each .NET Foundation project. There are a number of reasons why it takes this approach:

  1. Some projects are part of a larger organization - but that does not mean that every project within an organization was contributed to the .NET Foundation. In some cases only a single project was contributed. Therefore the metrics cannot be collected based on organization - they need to be collected by project. A good example... I have no idea if all 96 repositories which are part of the nanoFramework organization have been officially contributed to the .NET Foundation?
  2. Some organizations are umbrellas for many unrelated projects ie. the "dotnet" organization on Github hosts 225 projects - some part of the .NET Foundation, and some not... some maintained by Microsoft, some not... some grouped with other projects, some stand-alone, etc...
  3. There is currently no way within the service to group projects together - it is based only on individual projects. So the service would need to be upgraded to track this information and aggregate the data.
  4. If the service were to be expanded to group multiple projects, there would need to be outreach to every project to validate and ensure all applicable repos are tracked or else the results would be skewed (in the opposite direction than they are now). This will be time consuming but the new version would not be able to be rolled out until this was completed.
  5. If the service were to be expanded to group multiple projects , the results which have been collected since 2020 would have no correlation to the new results (as the criteria would have been significantly altered) - so the trend analysis (which is the primary purpose) would be severely impacted. I am not sure if/how to resolve this issue.
  6. If the service were to be expanded to group multiple projects, the total number of repos being tracked would likely increase by 4-5X. Due to Github API throttling restrictions the service already takes ~30 minutes to run each day (to collect all of the metrics for ~100 projects)... therefore the resource consumption would increase significantly (the .NET Foundation does not currently pay for this consumption).

The modifications described above would only account for projects with multiple repos that need to be grouped - it would not deal with many other anomalies which are result of unique project behaviors (ie. some projects use PRs exclusively, others use Commits without PRs, some have user docs in Github, others do not, some use automated bots to create PRs and Issues, others do not ). At the end of the day, there will never be a system which everyone agrees provides a fair and perfect representation of metrics. However, having an imperfect system is much better than having no system at all.

Based on the above, my suggestion would be to update your Url in the service to the repo which tends to have the highest activity on a regular basis. This is essentially what other projects are doing, and although it is not 100% accurate, it at least is consistent.

josesimoes commented 1 year ago

Thank you for you very detailed explanation. I understand that tracking projects that spawn over multiple repositories can be changeling. And that every project has it's own specifics.

Regarding .NET nanoFramework: not all repos are "development" ones. Some of them are forks for other repos that we depend on and for which we need tweaks and/or specific changes for which we found it was better to simply maintain a fork and keep those changes on a sperate branch. In a round number 80 of those are "development" projects. Ranging from firmware, to C# libraries, CLI tools, VS and VS Code extensions, pipeline tools and github actions. Plus we keep all our issues under Home.

One of the visible outputs of these are 200+ NuGet packages that we are responsible for: https://www.nuget.org/profiles/nanoframework Adding to these there are the firmware images that are hosted over at Cloudsmith: https://cloudsmith.io/~net-nanoframework/repos/

With the overall complexity of .NET nanoFramework it would be un-manageable to keep all this under a single repository!

Because of this diversity of repositories (and their purpose), it's hard to pick a repository that truly reflects the development pace and activity of the project... If I had to choose one, it would be https://github.com/nanoframework/nf-interpreter

Can you please let me know where to submit the PR to have this updated?

sbwalker commented 1 year ago

@josesimoes the application which captures and displays the activity metrics is open source and is maintained in this repo. The data identifying the .NET Foundation projects as well as the metrics are stored in a database running on Azure. I will update the Url for your project with the repo provided above.

josesimoes commented 1 year ago

Perfect. Thank you!

sbwalker commented 1 year ago

@josesimoes changing the repository Url significantly altered the criteria resulting in what appears to be a massive spike (ie. 2000+ PRs). Essentially I ran into item 5 I described earlier in this thread. This is not accurate and negatively impacts the validity of the entire dashboard.

My preference is to create a new project entry with the new Url (and retain the old project entry with the old Url) to keep data separated and prevent this anomaly. I hope this makes sense.

I would have suggested that we could update the historic metrics for the new Url but unfortunately GitHub does not have a way to access historic daily totals - which is the reason why this dnfprojects.com service was created.

josesimoes commented 1 year ago

I don't know the details on how data is collected and processed, so I have no way of offering advice.

Doesn't seem surprising that considering the previous stats were based (since ever) on a repo that doesn't have activity (only issues) , switching to a repo that's active can cause this spike...

Being the ultimate goal to reflect (although partially) the project true activity, please go ahead and make the necessary adjustments to make this happen.

sbwalker commented 1 year ago

The details for how data is collected is fairly straightforward - on a daily basis it relies on the APIs exposed by Github to get a variety of metrics (multiple API calls are required to get all of the data - so the tricky part is dealing with the throttling restrictions). The code which collects the data is here https://github.com/oqtane/DNF.Projects/blob/master/Server/Jobs/GithubJob.cs

I have created a new entry in the dnfprojects.com service for the nf-interpreter repo and also retained the entry for the home repo.

nf-interpreter - https://www.dnfprojects.com/*/27/View?id=116&from=Nov-09-2022&to=Dec-09-2022&metric=pr home - https://www.dnfprojects.com/*/27/View?id=97&from=Nov-09-2022&to=Dec-09-2022&metric=pr

You can see that metrics are now being collected for both entries. It will take some time to get enough of a dataset for the nf-interpeter entry where it will be fully represented in the dashboard. This is because the dashboard is focused on growth trends - ie. it compares the metric total for the first day in the specified period with the metric total for the last day in the period. So obviously the longer the period, the greater the difference (and hence the higher the growth) - but that assumes you have a full population of data for the entire period.

sbwalker commented 1 year ago

Note that I will be making some small enhancements to the service to include a friendly title for each entry, description (which can be used in the search), tags (for classification), and assorted usability improvements related to other issues logged in this repo.

josesimoes commented 1 year ago

Got you! Thanks.

I'm cc'ing @ellerbach who has been taking care of the tool for analysis on nanoFramework NuGet downloads. Maybe he has something to add to this conversation.

sbwalker commented 1 year ago

Note that we currently only rely on GitHub metrics - not Nuget.

Ellerbach commented 1 year ago

@sbwalker as we have multiple repos, would it make sense to adjust a bit the logic and have a "group" view, so having the same logic but cumulated or something equivalent?

sbwalker commented 1 year ago

@Ellerbach please review my detailed response higher up in this thread - it explains the challenges of the grouping approach.

Ellerbach commented 1 year ago

@Ellerbach please review my detailed response higher up in this thread - it explains the challenges of the grouping approach.

what if we provide an endpoint already aggregating all this everyday for nanoFramework? The service would typically run before your service calls it? So we would be responsible to maintain our 80+ repo list and just provide you the needed data.

sbwalker commented 1 year ago

I would prefer to avoid implementing custom data aggregation for each project - it would not be scalable, would have many points of failure, could not be easily validated, etc...

Ellerbach commented 1 year ago

it would not be scalable

I don't agree here, it moves the responsibility of having the metrics available to nanoFramework, using the exact same mechanism and even the exact same code but aggregating than the system you have in place

would have many points of failure

Only 1 additional: having our metric collector running. It's the same as if for some reason the collector line for the repo you are checking for us will fail.

could not be easily validated

Idea is to use the same code as you're using for collecting the metrics in one of our repository, aggregating them and having a file check in automatically by the metric collector. So it can easily be audited if that's the problem.

sbwalker commented 1 year ago

@Ellerbach my apologies, I should have provided more context. The reason I said that the suggestion is not scalable, has many points of failure, and is hard to validate is because I am looking at it from a holistic perspective of how to serve the needs of the .NET Foundation with its 100+ Member Projects. This service is used by the .NET Foundation Project Committee so the way it collects metrics needs to be consistent for all projects. Adding a custom mechanism for the nanoFramework does not achieve that goal. And creating a general abstraction which any project could use to feed in daily metrics would not be scalable, have many points of failure, and would be hard to validate. The main focus must be on simplicity, automation, consistency in data collection for ALL projects, and publicly accessible metrics which can be validated by anyone (the .NET Foundation needs to be able to verify their authenticity). I hope this helps explain my earlier comment.

Ellerbach commented 1 year ago

Adding a custom mechanism for the nanoFramework does achieve that goal

We can design it to be run in a transparent way and scalable way for others in the same situation

And creating a general abstraction which any project could use to feed in daily metrics would not be scalable, have many points of failure, and would be hard to validate

Give me examples of those ones. I don't really see an issue here!

publicly accessible metrics which can be validated by anyone

That is the reason I mention that it should be run by code that can be verified and data committed in the repo running them by the pipeline.

sbwalker commented 1 year ago

I already explained in my original response that out of the 100+ projects in the .NET Foundation, there are many with multiple repos which are part of their organization. But that does not mean that every project within an organization was contributed to the .NET Foundation - usually it was just a single repo and the .NET Foundation cannot assume that the IP in other repos should be treated in the same way. For example... I do not believe that all 96 repositories which are part of the nanoFramework organization have been officially contributed to the .NET Foundation. Some organizations are umbrellas for many unrelated projects ie. the "dotnet" organization on Github hosts 225 projects - some are part of the .NET Foundation, and some are not... some are maintained by Microsoft, some are not... some are grouped with other projects, some are stand-alone, etc... So in order to roll a change out to all 100+ projects in a consistent manner would be challenge - so I stand by my comments above. At this point I am not sure if it's useful to overhaul a service which is already serving the needs of the Project Committee for the past 2 years.

Ellerbach commented 1 year ago

OK, so what I read from your answer, then we're not the only one in this situation. I do understand the challenge of having something working for everyone and in your description, I understand that the current system works for those having 1 repo and in most casers, it's not the case. So happy to help working on a solution that will work for everyone. For the repositories, there are ways to tag them automatically for example than crawling them, find a specific file, things like this that would make it working for everyone. In our case about 80+ are related to the .NET foundation. So if you're interested in brainstorming how to solve this for everyone in an inclusive way, let me know.

sbwalker commented 1 year ago

You are correct that other projects have multiple repos - in fact, even my own Oqtane project has numerous repos on GitHub where are not currently tracked by the service.

And I believe I already described the solution in my earlier post in this thread...

The current automated service is already capable of collecting the required metrics on a daily basis for each repo - so this would not require any changes whatsoever.

The dashboard application would need to be enhanced with the concept of a higher level "organization" where multiple repos could be associated to it. Then the query which collates the data for the dashboard would need to group the data by organization. This would be the easy part.

Then there would need to be outreach to all 100+ .NET Foundation project to identify all applicable repos which should be tracked. This will be time consuming as maintainers are busy and do not always respond immediately. In addition the information would need to be cross referenced with the legal contribution agreements to validate the repos were indeed contributed. If the repos were not contributed then there may need to be additional agreements put in place. This will be a time consuming process as the .NET Foundation has no full time resources - only volunteers. I expect it will be especially time consuming to sort through the Microsoft owned projects in the .NET Foundation.

It is only after the outreach is completed that the new dashboard could be put into production - however this will result in some very strange results for a period of time as the criteria for collecting data will have changed drastically. It will result in massive spikes as it compares the older data set to the newer dataset. This will make the results fairly useless when you try to do trend analysis for a longer period of time. I am not sure how to solve this problem - it is a classic big data problem related to changing the way data is collected mid-stream.

Once the new system is in production it would increase the processing time for collecting the daily metrics by 3-4X. This is fine and I can appeal to the .NET Foundation for cost relief for the extra Azure consumption.

The benefit to the approach above is that the metrics are still collected directly from GitHub daily - so the service has minimal dependencies, and the metrics require no additional validation.

The question is if this enhancement to the service is high enough priority in comparison to the other volunteer responsibilities of the Project Committee at this time. I understand that you are saying the community can help with the enhancements - but ultimately that is only the technical enhancements, not the outreach or production rollout which is where the bulk of the work resides. I will raise this question at our next monthly Project Committee meeting.

Ellerbach commented 1 year ago

Great. Thanks a lot for the detailed elements. Happy to contribute here. And I do now understand much better the production challenge and the verification challenge.

One additional question to understand what can be automated automatically: is there an automatic way to check if a specific repo is affiliated to the foundation? If yes, then except for the production part, seems that all the rest can be automated. Which is a good news. Work in perspective, that's clear but that then can be done in an automated way.

So looking forward to the answer of your next committee!