nkiru-ede / MavenNetworkStudy

Other
0 stars 0 forks source link

Create GA by Year Stats #13

Open jensdietrich opened 1 month ago

jensdietrich commented 1 month ago

For GAs, create two timelines (GA counts by year), based on different ways to calculate the timestamp for a GA.

  1. a GA is counted for a given year if any GAV for this GA is released in this year
  2. a GA is counted for a given year only if the first GAV for this GA is released in this year

This gives us some interesting data to see maintenance activities.

Include unit tests for the calculation , and create the respective charts (two curves in one chart)

jensdietrich commented 1 month ago

@ulizue fyi

nkiru-ede commented 1 month ago

@jensdietrich done. charts are included in figures folder - GA_by_year_stats

jensdietrich commented 1 month ago

@nkiru-ede thanks -- did you write test cases ? if not please do and add the link to the test here as well

nkiru-ede commented 1 month ago

@jensdietrich test/TestGA2Compute.py

jensdietrich commented 1 month ago

@nkiru-ede the structure of the tests still has issues -- you basically embed a copy of the computation in the test script (compute_counts(df)) but this should be in the main script that computes the stats, and tests just reference (instead of cloning) this. Cloning causes issues when you start changing the scripts but forget to update the copies, then your tests give you wrong results.

You could describe your code as WET as opposed to DRY (see https://en.wikipedia.org/wiki/Don%27t_repeat_yourself what this means).

nkiru-ede commented 1 month ago

@jensdietrich done

jensdietrich commented 1 month ago

thanks @nkiru-ede !

jensdietrich commented 1 month ago

@nkiru-ede as discussed today also add issues where we observed the last release in this year. We don't need to display data for the last two years as this might be heavily biased as we discussed today. Re-opening the issue might be easier than opening a new one.

nkiru-ede commented 1 month ago

@jensdietrich this is done. Have updated the images on the overleaf document. I also included charts where i included the last two years for discussion during the next meeting

jensdietrich commented 1 month ago

thanks @nkiru-ede (also @ulizue please have a look at overleaf). @nkiru-ede - did you test this ? The numbers look very high, this would mean that new GAs are introduced almost at the same rate as they are abandoned. Or does the log scale misrepresent this a bit ? Could your perhaps email us versions with a linear scale ? Just to see the difference. But if we confirm this, this is a really interesting insight IMO.

I suggest to also study the number of GAVs released per year and present this in boxcharts. We can do this for all GAs, and for the top GAs (say top-100 as discussed). This would provide us with some interesting data about the correlation between maintenance and popularity.

jensdietrich commented 1 month ago

For the boxcharts we now have a separate issue #17 .

The other data in this chart:

image

@nkiru-ede Could you please take a snapshot for one year (say 2010) with the three lists of GAs which create the three datapoints for this year. Just a plain file list with one GA (not GAV) per year will do. This is the final QA step here.

nkiru-ede commented 1 month ago

@jensdietrich please review the attached.

AllGA_2010.txt earliest_GA_2010.txt Latest_GA_2010.txt