nkiru-ede / MavenNetworkStudy

Other
0 stars 0 forks source link

Create Release by Year Stats #17

Open jensdietrich opened 1 month ago

jensdietrich commented 1 month ago

For each GA and Year, count the number of GAVs for this GA released in this year, and for this year create a boxplot from those numbers.

Then do the same only for the top GA, i.e. the GAs with most dependents (incoming dependencies) in this year.

Put both series of boxplots in one chart to compare. This will enable us to explore correlations between popularity and maintenance/activity.

image

nkiru-ede commented 1 month ago

@jensdietrich please review the chart below:

image

jensdietrich commented 1 month ago

Thanks @nkiru-ede -- I think you misunderstood this: the y-axis should not be "number of dependencies" but "number of releases (GAVs)" for this year.

E.g. foo:foo released 1.0.0, 1.0.1 and 1.1.0 in 2010. Then the datapoint for this GAV that would go into the analysis is "2010,3".

cc @ulizue

nkiru-ede commented 1 month ago

@jensdietrich see the attached. This is a plot of the artifact releases (count of GAV for each GA plotted across the years, top 100 GAs, and the median GA for the 2)

image

nkiru-ede commented 1 month ago

@jensdietrich I mean median points*

image

jensdietrich commented 1 month ago

@nkiru-ede thanks, this looks more like what I expected. Could you please still write some tests for this/ or sample the data for consistency. Also, how do you decide what is in the top hundred?

A follow up question: what explains the dip in 2017 -- since this is only about 100 components, we should be able to explain this.

nkiru-ede commented 1 month ago

@jensdietrich In the chart above, the top 100 GAs are determined as the ones with the highest number of GAV releases. And the dip in 2017 can be explained by the dataset where the number of GAVs released in 2017 (459374) is less than what was released in 2016 (830559)

I have also included another chart here where all GA is still plotted according to the number of GAV releases but the top 100 GA are determined as those with the highest number of dependencies

image

jensdietrich commented 1 month ago

@nkiru-ede thanks -- the last chart is confusing, lets talk about this on Friday. Re the 2017 dip -- how do you then explain the increase for the top 100 from 2017 to 2018 ? There must be something else going on here.
cc @ulizue

jensdietrich commented 4 weeks ago

@nkiru-ede from discussion in meeting: we need to select top-100 by dependent-counts (in-degree) , but the values plotted are releases (i.e. GAVS for a GA for a given year).

Also, if there are datapoints with very high y-values, please sample them.

nkiru-ede commented 4 weeks ago

@jensdietrich the GA with the highest release datapoint is com.amazonaws:aws-java-sdk-handwritten-samples , I have also included a file containing the top 100 GAs for year 2016. From my investigations from the release_all, this data seems to tally as well. Also with the releases on maven central. top_100_components_2016.csv

jensdietrich commented 4 weeks ago

@nkiru-ede thanks, interesting !

I can confirm that:

  1. grep -c 'com.amazonaws:aws-java-sdk-handwritten-samples' release_all.csv > 829 !
  2. grep -c 'com.amazonaws:aws-java-sdk-handwritten-samples.*"2016' release_all.csv > 827 !

Looking at the top-100, I would expect that non of those is in the top 100 -- so those would be outliers in the general dataset. But looking at the boxplot above it seems that there are also outliers in the top 100 by indegree (dependents). What are those ?

I think if we just look at the two median curves, even with those outliers, we will still see what we expected: that popular components release more often. The outliers are good for the paper as they add some colour to the story.

Please still generate the graph as discussed today (https://github.com/nkiru-ede/MavenNetworkStudy/issues/17#issuecomment-2306079616) .

cc @ulizue

nkiru-ede commented 4 weeks ago

@jensdietrich Below chart show the top 100 GAs by dependent-count.

In the case of amazonaws, it doesn't appear true that popular components release more, however, in the case of wso2, it appears so. I have included again a sample data for 2016 - GAs in top 100 by GAV releases and GA in top 100 by dependent-count.

image

top_100_components(by GAV release count)_2016.csv top_100_components(by dependent-count)_2016.csv

below are also those of 2017: top_100_components(by GAV release count)_2017.csv top_100_components(by dependent-count)_2017.csv

and those of 2018: top_100_components(by GAV release count)_2018.csv top_100_components(by dependent-count)_2018.csv

jensdietrich commented 4 weeks ago

@nkiru-ede the boxplots doesnt show us much, could you just plot the medians ? Then we should start seeing some differences as the scale would be very different.

nkiru-ede commented 3 weeks ago

@jensdietrich the medians image

jensdietrich commented 3 weeks ago

@nkiru-ede thanks -- this needs some QA - my gut feeling is that the 2 values for "all" are too high. Could you please sample and confirm this ? We could also report averages (means) in the same chart.

nkiru-ede commented 3 weeks ago

@jensdietrich Here is a plot of the means and medians in the same chart. I have also sampled the data for 2010 and 2009, which correspond with what is in the dataset. Is there any particular year you would like me to sample?

image

top_100_components(by dependent-count)_2010.csv top_100_components(by dependent-count)_2009.csv

jensdietrich commented 3 weeks ago

@nkiru-ede I just noticed that the last chart still uses depedent-count as y-Axis , we want number of releases here ! See comment https://github.com/nkiru-ede/MavenNetworkStudy/issues/17#issuecomment-2306079616 from a previous meeting when we discussed this

nkiru-ede commented 3 weeks ago

@jensdietrich yes, the y-axis is the number of releases. I had forgotten to modify the label.

image

jensdietrich commented 3 weeks ago

@nkiru-ede the values look to high. If a GA has 0 releases in a given year, is this counted and part of the computation ? Did you test for this?

jensdietrich commented 3 weeks ago

@nkiru-ede just to add, I think we should count as follows (but what do you think ?):

nkiru-ede commented 3 weeks ago

@jensdietrich Currently, if a GA has no release in a year, it is not counted. I don't know what the benefit of counting a GA for a particular year will be if it didn't release any GAV in that year. If we want to see usage, releases, maintenance, and popularity, won't counting a release for a GA when it didn't release anything negate that? Please tell me more about the significance.

Also, which of the charts do you think is high? Because the mean and median (for all GA and top 100 GA) are calculated from the below chart where the top 100 is first computed with the number of dependencies they have, and then number of releases for those GAs are plotted. I already sampled this and you confirmed it.

image

image

jensdietrich commented 3 weeks ago

@nkiru-ede once a GA is in the repo it is available and can be used. It is never removed from the repo (npm allowed this at some stage, and it had some unexpected side-effects (have a look at https://en.wikipedia.org/wiki/Npm_left-pad_incident). Zero releases in a year means available, but not maintained in this year, so this is really interesting for us IMO.

nkiru-ede commented 3 weeks ago

@jensdietrich I am now putting a count of 1 for GAs that existed year/s prior without existing in the current computed year and I don't quite see much difference both in the general chart and the median/mean chart... just a little bit of decrease in the mean

see2 meansee

jensdietrich commented 3 weeks ago

@nkiru-ede as just discussed in the meeting, here is a scenario to explain this:

year: 2010 GA1 -- 2 releases GA2 -- 4 releases GA3 -- 0 releases (but had one in 2009) GA4 -- 0 releases (but first release will be in 2011)

mean should be 6/3 = 2 but mean is currently 7/2

TODO: turn this scenario into a regression test, fix script

nkiru-ede commented 2 weeks ago

@jensdietrich this is updated and tested. Used a lineplot to show data instead

image

jensdietrich commented 2 weeks ago

@nkiru-ede -- did you write the test cases for this as discussed ? How can I run them ? Also, is this chart now for all or for the top-100 by popularity

nkiru-ede commented 2 weeks ago

@jensdietrich please discard the previous chart I was initially not excluding components with first release after the currently computed year (in the case of the test data, this will be 2011)

see below chart for both all and top 100 image

How to run it:

The test script is here: test script The test data are here: test data(https://github.com/nkiru-ede/MavenNetworkStudy/blob/main/Project/data/GA_test.csv) And the main script is here: main script

You can clone the latest MavenNetworkStudy, navigate to the test folder and run the test script - meanMedianGaTest.py

jensdietrich commented 2 weeks ago

thanks @nkiru-ede - lets run the tests in the next meeting. I am still sceptical about the low median of the top 100. That means that most of them >50 did not have any release during those years. Can this be the case ??

nkiru-ede commented 2 weeks ago

@jensdietrich Firstly, I'd like to plead that you do not get upset about these iterations and for you to know that I made sure to test all stages while writing the script but for some reason did not capture this and the test data was too small to capture this at unit testing.

What is happening here is from the approach I initially took which was to fill the top 100 dataset where components existed in prior years but not in the present year with counts of 0 as I did in all ga data. Instead of just filtering the all ga data (which already captured this logic) to get the top 100 and stopping there. This increased the denominator for the mean computation of top 100, resulting to low means and affected the median computation as well. Here is the initially computed top 100 data top_100_components(initial).zip

See below corrected:

top100_components(corrected).zip I have also attached the all ga data if you are interested in confirming this as well: all_components.zip Attached also are mean and median computed for top 100 components for all years using above data and plotted: I have also updated the scripts on Github. Please let me know if you need to have a call for me to explain further or perhaps at the meeting on Friday. mean_counts_per_year.csv median_counts_per_year.csv image