Open jensdietrich opened 2 months ago
@jensdietrich I was able to extract a dependency relationship dataset in the below format: first_10_records.csv This dataset is above 18GB, however I struggled to see timestamps and tried to scrape the timestamps from maven central, this process took so much time and was not finding a big number of the artifacts so I ended the process.
I have just seen a part of the dump that contains timestamps and I am currently trying to convert the data. It is taking time, I guess because of the size.
I wanted to provide an update on how far I have gone with the data. Will provide another update by COB Tuesday.
thanks @Nkiru, noted . Is there perhaps a schema that describes the structure of the dataset?
@nkiru-ede also, you will need to break uo relationshipDetails into columns. Is the target version always a version, or can it be a version constraint (for instance, also could be a range) ?
@nkiru-ede answering my own question -- the zenodo dump has a metamodel (=schema), there is a pdf in https://zenodo.org/records/13734581, here: https://zenodo.org/records/13734581/files/metamodel.pdf . This shows timestamps, you just need to define a cypher queries to extract those. CC @ulizue
@jensdietrich, thanks for sharing. Yes, I saw the timestamp info, which was why I pasted the below earlier. I have extracted the new data, which is about 50GB, and I am now cleaning the data.
thanks @nkiru-ede -- looking forward to see how this is going. I suggest to do as much as possible of the analysis with CYPHER queries, this gives it a nice declarative touch. Note the meeting invite for tomorrow, I think you haven't accepted it yet.
@jensdietrich I have accepted the invite. The dataset has 119,660,406 records(edges(GAV)) and the release dates/timestamps belong to the dependencies. I have tried to match the dependencies to artifacts and there are only 175300 matches.
Will talk about these in the meeting today.
@jensdietrich
You can find the cypher queries converting the data into links_all and release_all here cypher queries
The dataset contains 119,660,406 edges/relationships links_all release_all
After merging the links_all with the release_all, 9 gavs in the source column are without release dates, while 5,104,196 in the target column are without release dates. merged data
There is no use of dependency range in any of the source artifacts, however, the target artifacts has 813,343 gavs with dependency ranges. I have extracted this into a separate csv - filtered_version_ranges.zip
How I intend to handle the missing/odd data:
What do you think about the approach?
@nkiru-ede "while 5,104,196 in the target column are without release dates." (I guess you mean that the join with the GAV info cannot be resolved here) vs " the target artifacts has 813,343 gavs with dependency ranges". Note the wording here: there is no such thing as "813,343 gavs with dependency ranges" -- this is basically a "set of GAVs". So if you want to "update the 813,343 with the versions of the source" you need to pick one (or all) from this set. This is easy for one particular pattern ("[
I think it would be ok to document that for <3% of edges the target cannot be resolved, quickly study the reasons, and then ignore them. There must be reasons other then dependency ranges (they explain only approx 20%), so what are they ? I have an idea, but it would be good if you do some data sampling and find out independently.
cc @ulizue
@jensdietrich
the 813,343 targets with dependency ranges are without release dates. I have looked at the other 4m+ to see if there is a pattern such as -if target release dates are missing for specific source versions, if target release is null where source release is null. But I could not see any pattern.
I have removed those rows from the final dataset and we are left with 114,556,210 edges.
I have started replicating the experiments with the new dataset and updating the charts and tables in a folder 'newData' in the latex document.
@jensdietrich this is the regular expression pattern i used to extract dependency ranges -
@jensdietrich I have uploaded the latest plots on overleaf - will discuss this further in the meeting.
Some highlights: Below is the innovation chart - note: out of the 14,459,139 Gavs, there aare 206,485 with unusual version formats
Below plots show the ratio of first GA to Last GA: in 2004, the ratio is 2.0, meaning there were twice as many new GAs introduced (6) compared to the GAs that were last seen (3) In 2018, with a ratio of about 1.23, the counts of new (47,334) and retired (38,365) GAs are close, suggesting a balanced churn between new and outgoing GAs In 2023, with a ratio of 0.88, more GAs were last seen (76,540) than first introduced (67,690), which could imply that more GAs are reaching end-of-life or are becoming obsolete in that year Higher ratios in early years (e.g., 2002 to 2010) imply rapid growth in the addition of new GAs, with more entering than exiting. Ratio decrease can be seen between 2023 to 2024, but we already discussed why this is the case and hence why the last 3 years are removed from the last GA plot.
Why does the Innovation Major Version Curve only start in 2005 ?
Discussed today (@nkiru-ede @ulizue ):
@jensdietrich Major release to Last GA: There are 141,214 GAs with single version releases. I plotted these events after removing these GAs and the pattern overall seem consistent with before except for the beginning of lastGA. ![Uploading innovationGA_minusSingleGAs.png…]() And effect can be seen in the below innovation charts:
Ratio of replacement: and this is computed thus: Len(new_GA)/len(current_top_N(10,100,500)) * 100 where new_GA is the GAs in current topN that are not in previous topN
I also checked the ratio of renaming, there is no clear pattern amongst the elites:
For background, see:
Tasks