top 100 analysis - Githubissues

nkiru-ede commented 2 months ago

We discussed this briefly last week. When analysing the top 100 GAs, we need some charts that for each year show the number of retained from previous year and new.

We also need some data how fast the rise are – perhaps how many of the new ones in a given year were in the 100-200, 200-300 etc bracket the year before.

For the ones declining, we need to analyse them and look for some patterns (which might all represent false positives). I can think about the following:

Renaming (should be easy to detect as new names are similar), example: junit > junit.org, jdom > jdom2 etc Make a direct dependency a deep dependency of a popular direct dependency, we discussed an example last time, I think it was log4j. Using the transitive closure graph will help here: in the GA-TC graph the respective GA would remain popular Projects splitting into modules and also splitting popularity Others (please come up with some)

nkiru-ede commented 2 months ago

@jensdietrich

I looked at artifact retention in top 100 of current year compared with top 100 of previous year: Artifact_retention Artifact retention in top 100 of current year compared with top 200 of previous year: Artifact_retention_top100vs200 Artifact retention in top 100 of current year compared with top 300 of previous year: Artifact_retention_top100vs300

Ranked artifacts by years present rankedArtifacts_byYearsPresent Used a heatmap to show movement of top 10 artifacts across the years(note: heatmap doesnt show all artifacts for readability) HeatMap_Movement_of_elites Heatmap of top 100 (you may have to zoom in on the image): HeatMap_Movement_of_elites_top100_2

I specifically looked at these 3 artifacts - log4jvsdom4j, Junit and hibernate. Junit appears to be a top contributor from when it and the other similar libraries entered the ecosystem to end of the ecosystem. junitVSdom4j_top100

Same can be seen for log4j:

Heatmap_LogArtifacts But different for hibernate, where the original artifact, org.hibernate:hibernate seems to have fallen off, I am investigating this from the repo as it seems some hibernate artifacts have been moved to other hibernate_top100

jensdietrich commented 2 months ago

@nkiru-ede thanks for this - charts look great, but it does not tell us too much IMO - IO might be wrong though, to be discussed tomorrow.

I think to have a contribution we must go past observing those effects on the surface, and explain what happens to those artifacts. This is where the patterns I suggested come in, i.e. categorise movements acc to those categories:

Renaming (should be easy to detect as new names are similar), example: junit > junit.org, jdom > jdom2 etc
Make a direct dependency a deep dependency of a popular direct dependency, we discussed an example last time, I think it was log4j. Using the transitive closure graph will help here: in the GA-TC graph the respective GA would remain popular
Projects splitting into modules and also splitting popularity
Others (please come up with some)

I don't think that this can be done without looking at those (doable for the top-100), perhaps with the help of some string matching and analysing the TC.

nkiru-ede commented 2 months ago

@jensdietrich

I did a fuzzy string matching of the aggregated artifacts in top 100 across the years to identify potential renaming based on similarity scores. I have attached the result below and in 8.05% of the compared artifacts resulted in a match

potential_renames.csv

I used the transitive dependency graph to check the top 100 GAs and similarly to the direct dependencies, the top GAs remained top.
Also did something similar to check potential splits artifact_splits.csv

jensdietrich commented 2 months ago

As discussed in meeting -- top10 / top100 are good, extends bars to both sides by "being in top-200" (with a different lighter colour), and "being in the repository" (even lighter).

jensdietrich commented 2 months ago

@nkiru-ede commenting on your earlier response: potential_renames.csv does contain many FPs , examples:

there are many more, and also FNs (they are more difficult to spot, probably junit 4>5 is one), so this method does not work. By looking at the mapping info in the repo, we get the actual semantic info, name matching is just on the syntax level.

Please use the method described above, this should not be too difficult.

nkiru-ede commented 2 months ago

@jensdietrich plotted some artifacts, in some of the cases, the top artifacts became top 100 in their year of release

jensdietrich commented 2 months ago

What does this red cross vs double cross mean ? Can we combine them in one chart (using different visual cods for top100 vs top200)? Does this contain the aliasing info ?

nkiru-ede commented 2 months ago

@jensdietrich I will try to combine, no this does not contain the aliasing info. I am still working on that. I am not sure I will be ready to show the results by Friday's meeting.

jensdietrich commented 2 months ago

@nkiru-ede thanks -- parsing mvnrep data should not be difficult. In Java I would use jsoap to parse responses, but I am sure there are similar libraries that make it equally easy for other languages.

nkiru-ede commented 2 months ago

@jensdietrich, I encountered a few challenges trying to bypass mvnrepo's security. I have been able to resolve those and extracted new GA's and their first release for the top 100 artifacts across years. Top100_NewGA.csv Below is the visualization of some of them. The chart shows for each GA, the year it was released, years it was in top 100 and year it was relocated.

I need more time to do other analysis - such as compare when they relocated GA joined top 100, top 200,..300, and if that coincides with the year the original GA fell off. And perhaps scrape more data and do more analysis.

jensdietrich commented 2 months ago

@nkiru-ede I just had a look at the datafile, a few questions and notes:

please describe the process in your own words in a short paragraph (3 senteneces max) to be re-used in paper - perhaps add as md into a notes/ folder in project
what does Merged_New_GA mean (column name in csv) ? I think that this is the new GA, but why "merged" ?
New_GA_release_year column has timestamps not years
perhaps clean this up -- with columns like GA1, GA2, plus respective first release years
dependency_count column is also a bit confusing, as we would need this per year , so for our analysis those values would not be used (please confirm this!)
are there cases where we experience two renamings of the same artifact ? I.e. have you checked this, and if so how
have you tested that after a renaming event, dependencies to both the old and the new GA are counted ? The old GA still attracts dependencies ! A good example is junit (junit4), there are still new projects today many years after the junit5 (jupiter) release using it ! We discussed this last week but I just want to make sure this is what you did.

CC @ulizue

nkiru-ede commented 2 months ago

@jensdietrich

Process:

Configured Selenium webdriver to use Chrome
From the CSV containing GAs’ , searched through the Maven Repository https://mvnrepository.com///
Searched for relocation information within the page - "//div[b[contains(text(), 'This artifact was moved to:')]]//a"
Retrieved new artifact name if exists, and update a new column in CSV (New_GA), otherwise, update column with ‘No relocation’
Search the Maven central repository : https://repo1.maven.org/maven2//maven-metadata.xml) with artifacts in New_GA column, parse xml and retrieve earliest version
Merge New_GA and version – GAV
Match GAV with release_all dataset to retrieve release dates

I have cleaned up the data more to contain only the relevant columns - attached is data for top 500 in each year. top_500_per_year_GA_latest.csv

From the data above , appears none of the Gas has been renamed twice

Dependencies to all Gas are counted resulting to top 100, 200, 500, etc.

jensdietrich commented 2 months ago

@nkiru-ede results look good and methodology is sound, more or less what we had discussed! I suggest tweaking the format a bit - some of the column names don't make much sense to me. I think having two columns GA_OLD and GA_NEW would be sufficient, we will get everything else we might need (first GAV of GA_NEW, release dates and years) from other tables, so we can just do joins during the analysis.

If would be interesting now to redraw the lifecycle charts with this data, also adding the rename events. Please make sure that we count dependencies to both the old and the new GAs, in particular for the first years after the release of GA_NEW.

CC @ulizue

jensdietrich commented 2 months ago

A quick reminder that the last discussion was actually for a different issue #21 (now closed), we still have some work to do here (copying text from original issue description):

[x] Renaming (should be easy to detect as new names are similar), example: junit > junit.org, jdom > jdom2 etc
[ ] Make a direct dependency a deep dependency of a popular direct dependency, we discussed an example last time, I think it was log4j. Using the transitive closure graph will help here: in the GA-TC graph the respective GA would remain popular
[ ] Projects splitting into modules and also splitting popularity
[ ] Others (please come up with some)

nkiru-ede commented 2 months ago

@jensdietrich

On Renaming

csv of old GA/new GA rship: NEW_GA_OLD_GA.csv

csv showing the different dependency counts for old and new GA - this was used to determine top components

Old_New_GA_Counted.csv csv with updated release dates: OldNEWGA_final.csv Typical scenario where an old GA enters top in same year:

Unique situation where old and new GA are same names - this is the only case I seen from the data.

I plot a few top components:

nkiru-ede commented 2 months ago

@jensdietrich for Projects splitting into modules and also splitting popularity, I feel in a way, the relocation detail we already have speaks to that. My understanding about Maven is that when are split or renamed, they are often relocated/renamed.

nkiru-ede commented 2 months ago

@jensdietrich Make a direct dependency a deep dependency of a popular direct dependency, we discussed an example last time, I think it was log4j. Using the transitive closure graph will help here: in the GA-TC graph the respective GA would remain popular

I used to transitive GA graph to check popular GAs and popular GAs remain popular :

nkiru-ede commented 2 months ago

@jensdietrich

I used boxes to show the different GA events (release date, when in top 100, top 500, new GA)

I grouped the different categories to see how the lose/gain popularity (Top 100) Data Processing GAs: cat1 Testing and QA: cat2 Logging and monitoring: cat3 Java specs: cat4 Core utilities: cat5

I also compared the GAs in top 100(according to when they were released):

jensdietrich commented 2 months ago

@nkiru-ede some interesting results here -- in particular the decline of xml and rise of json tells a good story. There are a few other popular libs in the meta category , like snakeyaml (for yaml), how do those fit in ?

Re-testing category -- I assume the mocking framework is mockito -- there are now too very popular artifacts mockito-core and lockito-all. Have a look at this discussion:https://www.baeldung.com/mockito-core-vs-mockito-all . I.e. this is a different distribution of mockito that shdes / bundles some dependencies. This is another pattern we could add to https://github.com/nkiru-ede/MavenNetworkStudy/issues/20#issuecomment-2372579610 , i.e. when we count dependencies, those should probably be handled as synonyms.

CC @ulizue

nkiru-ede / MavenNetworkStudy

top 100 analysis #20