Switch experiments to Goblin Dataset

jensdietrich commented 2 months ago

For background, see:

Jamie et al: Goblin: A Framework for Enriching and Querying the Maven Central Dependency Graph. https://dl.acm.org/doi/10.1145/3643991.3644879
https://2025.msrconf.org/track/msr-2025-mining-challenge

Tasks

[ ] Recreate GAV Graph, report |V| and |E|
[ ] Recreate GA Graph, report |V| and |E|
[ ] Recreate G Graph, report |V| and |E|
[ ] Rerun SCC Analysis
[ ] Rerun Transitive Closure Analysis
[ ] Recomute gini / Lorentz curves
[ ] Rerun analysis of elites (Top-100, Top-200, Top-500) , incl patterns for decline

nkiru-ede commented 1 month ago

@jensdietrich I was able to extract a dependency relationship dataset in the below format: first_10_records.csv This dataset is above 18GB, however I struggled to see timestamps and tried to scrape the timestamps from maven central, this process took so much time and was not finding a big number of the artifacts so I ended the process.

I have just seen a part of the dump that contains timestamps and I am currently trying to convert the data. It is taking time, I guess because of the size.

I wanted to provide an update on how far I have gone with the data. Will provide another update by COB Tuesday.

jensdietrich commented 1 month ago

thanks @Nkiru, noted . Is there perhaps a schema that describes the structure of the dataset?

jensdietrich commented 1 month ago

@nkiru-ede also, you will need to break uo relationshipDetails into columns. Is the target version always a version, or can it be a version constraint (for instance, also could be a range) ?

jensdietrich commented 1 month ago

@nkiru-ede answering my own question -- the zenodo dump has a metamodel (=schema), there is a pdf in https://zenodo.org/records/13734581, here: https://zenodo.org/records/13734581/files/metamodel.pdf . This shows timestamps, you just need to define a cypher queries to extract those. CC @ulizue

nkiru-ede commented 1 month ago

@jensdietrich, thanks for sharing. Yes, I saw the timestamp info, which was why I pasted the below earlier. I have extracted the new data, which is about 50GB, and I am now cleaning the data.

jensdietrich commented 1 month ago

thanks @nkiru-ede -- looking forward to see how this is going. I suggest to do as much as possible of the analysis with CYPHER queries, this gives it a nice declarative touch. Note the meeting invite for tomorrow, I think you haven't accepted it yet.

nkiru-ede commented 1 month ago

@jensdietrich I have accepted the invite. The dataset has 119,660,406 records(edges(GAV)) and the release dates/timestamps belong to the dependencies. I have tried to match the dependencies to artifacts and there are only 175300 matches.

Will talk about these in the meeting today.

nkiru-ede commented 1 month ago

@jensdietrich

You can find the cypher queries converting the data into links_all and release_all here cypher queries

The dataset contains 119,660,406 edges/relationships links_all release_all

After merging the links_all with the release_all, 9 gavs in the source column are without release dates, while 5,104,196 in the target column are without release dates. merged data

There is no use of dependency range in any of the source artifacts, however, the target artifacts has 813,343 gavs with dependency ranges. I have extracted this into a separate csv - filtered_version_ranges.zip

How I intend to handle the missing/odd data:

Target release date (dependency range) - update the 813,343 with the versions of the source
Scrape Maven central for release dates of other missing gavs
Remove the rows/records of gavs still unresolved(without release dates).
continue analysis with final dataset

What do you think about the approach?

jensdietrich commented 1 month ago

@nkiru-ede "while 5,104,196 in the target column are without release dates." (I guess you mean that the join with the GAV info cannot be resolved here) vs " the target artifacts has 813,343 gavs with dependency ranges". Note the wording here: there is no such thing as "813,343 gavs with dependency ranges" -- this is basically a "set of GAVs". So if you want to "update the 813,343 with the versions of the source" you need to pick one (or all) from this set. This is easy for one particular pattern ("[]") but tricky otherwise, and will lead to too many edges that interfere with the stats.

I think it would be ok to document that for <3% of edges the target cannot be resolved, quickly study the reasons, and then ignore them. There must be reasons other then dependency ranges (they explain only approx 20%), so what are they ? I have an idea, but it would be good if you do some data sampling and find out independently.

cc @ulizue

nkiru-ede commented 1 month ago

@jensdietrich

the 813,343 targets with dependency ranges are without release dates. I have looked at the other 4m+ to see if there is a pattern such as -if target release dates are missing for specific source versions, if target release is null where source release is null. But I could not see any pattern.

I have removed those rows from the final dataset and we are left with 114,556,210 edges.

I have started replicating the experiments with the new dataset and updating the charts and tables in a folder 'newData' in the latex document.

nkiru-ede commented 1 month ago

@jensdietrich this is the regular expression pattern i used to extract dependency ranges -

nkiru-ede commented 4 weeks ago

@jensdietrich I have uploaded the latest plots on overleaf - will discuss this further in the meeting.

Some highlights: Below is the innovation chart - note: out of the 14,459,139 Gavs, there aare 206,485 with unusual version formats innovationGA

Below plots show the ratio of first GA to Last GA: First-Last-GA-ratio First-Last-GA-ratio-minus2years in 2004, the ratio is 2.0, meaning there were twice as many new GAs introduced (6) compared to the GAs that were last seen (3) In 2018, with a ratio of about 1.23, the counts of new (47,334) and retired (38,365) GAs are close, suggesting a balanced churn between new and outgoing GAs In 2023, with a ratio of 0.88, more GAs were last seen (76,540) than first introduced (67,690), which could imply that more GAs are reaching end-of-life or are becoming obsolete in that year Higher ratios in early years (e.g., 2002 to 2010) imply rapid growth in the addition of new GAs, with more entering than exiting. Ratio decrease can be seen between 2023 to 2024, but we already discussed why this is the case and hence why the last 3 years are removed from the last GA plot.

jensdietrich commented 4 weeks ago

Why does the Innovation Major Version Curve only start in 2005 ?

jensdietrich commented 4 weeks ago

[ ] Fix label in first-to-last (it is the reverse atm)
[ ] Add labels to y axis
[ ] Please add gridlines
[ ] Add second metric (first-releases + new-major-versions)/last-releases , add those ratios to charts 2 and 3, keep the old ones (call them innovations1 and innovation2 or similar )
[ ] Re ginis -- Lorentz curve charts may not be suitable as there are too many curves / similar colours. Create a line chart please for ginis, ginis for GAVs and for GAs can go into same chart
[ ] Please save list of unparsable version strings

jensdietrich commented 3 weeks ago

Discussed today (@nkiru-ede @ulizue ):

[ ] add ratio purple / red to innovation chart
[ ] analyse top-N (top100, top500, make easy to configure) ratio of replacement for any given year chart. Example datapoint: 20 in 2010 for top100: in 2010, there were 20 GAs in the Top100 not in the Top100 in the previous year 2009. The datapoint 20 refers to the percentage (20 as 20% of 100)

nkiru-ede commented 2 weeks ago

@jensdietrich Major release to Last GA: MajorReleaseToLastGA There are 141,214 GAs with single version releases. I plotted these events after removing these GAs and the pattern overall seem consistent with before except for the beginning of lastGA. ![Uploading innovationGA_minusSingleGAs.png…]() And effect can be seen in the below innovation charts: MajorReleaseToLastGA_minusSingleGAs

Ratio of replacement: FractionOfReplacement and this is computed thus: Len(new_GA)/len(current_top_N(10,100,500)) * 100 where new_GA is the GAs in current topN that are not in previous topN

I also checked the ratio of renaming, there is no clear pattern amongst the elites: FractionOfRenaming

nkiru-ede / MavenNetworkStudy

Switch experiments to Goblin Dataset #22

Tasks