Open nkiru-ede opened 2 months ago
@jensdietrich
I looked at artifact retention in top 100 of current year compared with top 100 of previous year: Artifact retention in top 100 of current year compared with top 200 of previous year: Artifact retention in top 100 of current year compared with top 300 of previous year:
Ranked artifacts by years present Used a heatmap to show movement of top 10 artifacts across the years(note: heatmap doesnt show all artifacts for readability) Heatmap of top 100 (you may have to zoom in on the image):
I specifically looked at these 3 artifacts - log4jvsdom4j, Junit and hibernate. Junit appears to be a top contributor from when it and the other similar libraries entered the ecosystem to end of the ecosystem.
Same can be seen for log4j:
But different for hibernate, where the original artifact, org.hibernate:hibernate seems to have fallen off, I am investigating this from the repo as it seems some hibernate artifacts have been moved to other
@nkiru-ede thanks for this - charts look great, but it does not tell us too much IMO - IO might be wrong though, to be discussed tomorrow.
I think to have a contribution we must go past observing those effects on the surface, and explain what happens to those artifacts. This is where the patterns I suggested come in, i.e. categorise movements acc to those categories:
I don't think that this can be done without looking at those (doable for the top-100), perhaps with the help of some string matching and analysing the TC.
@jensdietrich
As discussed in meeting -- top10 / top100 are good, extends bars to both sides by "being in top-200" (with a different lighter colour), and "being in the repository" (even lighter).
@nkiru-ede commenting on your earlier response: potential_renames.csv does contain many FPs , examples:
there are many more, and also FNs (they are more difficult to spot, probably junit 4>5 is one), so this method does not work. By looking at the mapping info in the repo, we get the actual semantic info, name matching is just on the syntax level.
Please use the method described above, this should not be too difficult.
@jensdietrich plotted some artifacts, in some of the cases, the top artifacts became top 100 in their year of release
What does this red cross vs double cross mean ? Can we combine them in one chart (using different visual cods for top100 vs top200)? Does this contain the aliasing info ?
@jensdietrich I will try to combine, no this does not contain the aliasing info. I am still working on that. I am not sure I will be ready to show the results by Friday's meeting.
@jensdietrich, I encountered a few challenges trying to bypass mvnrepo's security. I have been able to resolve those and extracted new GA's and their first release for the top 100 artifacts across years. Top100_NewGA.csv Below is the visualization of some of them. The chart shows for each GA, the year it was released, years it was in top 100 and year it was relocated.
I need more time to do other analysis - such as compare when they relocated GA joined top 100, top 200,..300, and if that coincides with the year the original GA fell off. And perhaps scrape more data and do more analysis.
@nkiru-ede I just had a look at the datafile, a few questions and notes:
CC @ulizue
@jensdietrich
Process:
I have cleaned up the data more to contain only the relevant columns - attached is data for top 500 in each year. top_500_per_year_GA_latest.csv
From the data above , appears none of the Gas has been renamed twice
Dependencies to all Gas are counted resulting to top 100, 200, 500, etc.
@nkiru-ede results look good and methodology is sound, more or less what we had discussed! I suggest tweaking the format a bit - some of the column names don't make much sense to me. I think having two columns GA_OLD and GA_NEW would be sufficient, we will get everything else we might need (first GAV of GA_NEW, release dates and years) from other tables, so we can just do joins during the analysis.
If would be interesting now to redraw the lifecycle charts with this data, also adding the rename events. Please make sure that we count dependencies to both the old and the new GAs, in particular for the first years after the release of GA_NEW.
CC @ulizue
A quick reminder that the last discussion was actually for a different issue #21 (now closed), we still have some work to do here (copying text from original issue description):
@jensdietrich
On Renaming
csv of old GA/new GA rship: NEW_GA_OLD_GA.csv
csv showing the different dependency counts for old and new GA - this was used to determine top components
Old_New_GA_Counted.csv csv with updated release dates: OldNEWGA_final.csv Typical scenario where an old GA enters top in same year:
Unique situation where old and new GA are same names - this is the only case I seen from the data.
I plot a few top components:
@jensdietrich for Projects splitting into modules and also splitting popularity, I feel in a way, the relocation detail we already have speaks to that. My understanding about Maven is that when are split or renamed, they are often relocated/renamed.
@jensdietrich Make a direct dependency a deep dependency of a popular direct dependency, we discussed an example last time, I think it was log4j. Using the transitive closure graph will help here: in the GA-TC graph the respective GA would remain popular
I used to transitive GA graph to check popular GAs and popular GAs remain popular :
@jensdietrich
I used boxes to show the different GA events (release date, when in top 100, top 500, new GA)
I grouped the different categories to see how the lose/gain popularity (Top 100) Data Processing GAs: Testing and QA: Logging and monitoring: Java specs: Core utilities:
I also compared the GAs in top 100(according to when they were released):
@nkiru-ede some interesting results here -- in particular the decline of xml and rise of json tells a good story. There are a few other popular libs in the meta category , like snakeyaml (for yaml), how do those fit in ?
Re-testing category -- I assume the mocking framework is mockito -- there are now too very popular artifacts mockito-core and lockito-all. Have a look at this discussion:https://www.baeldung.com/mockito-core-vs-mockito-all . I.e. this is a different distribution of mockito that shdes / bundles some dependencies. This is another pattern we could add to https://github.com/nkiru-ede/MavenNetworkStudy/issues/20#issuecomment-2372579610 , i.e. when we count dependencies, those should probably be handled as synonyms.
CC @ulizue
We discussed this briefly last week. When analysing the top 100 GAs, we need some charts that for each year show the number of retained from previous year and new.
We also need some data how fast the rise are – perhaps how many of the new ones in a given year were in the 100-200, 200-300 etc bracket the year before.
For the ones declining, we need to analyse them and look for some patterns (which might all represent false positives). I can think about the following:
Renaming (should be easy to detect as new names are similar), example: junit > junit.org, jdom > jdom2 etc Make a direct dependency a deep dependency of a popular direct dependency, we discussed an example last time, I think it was log4j. Using the transitive closure graph will help here: in the GA-TC graph the respective GA would remain popular Projects splitting into modules and also splitting popularity Others (please come up with some)