nkiru-ede / MavenNetworkStudy

Other
0 stars 0 forks source link

Indegree Distribution #16

Open jensdietrich opened 3 months ago

jensdietrich commented 3 months ago

The following is not-intuitive -- we expected most GAVS to have a very low number of dependents (see also GINI courses) - needs additional QA !

@jensdietrich to reproduce data

image

jensdietrich commented 3 months ago

@nkiru-ede @ulizue here is my analysis (note that my graphs are a bit different as I exclude self-edges). Intervals are open on the left and closed to the right, i.e. 5..10 means >5 and <=10. This looks much more like what I had expected.

@nkiru-ede please sample your data - i.e. take some random GAVs and then manually compute the indegrees (number f dependents).

vertex count: 1,855,689 edge count: 8,751,900 in-degree analysis 0: 981506 0..2: 572268 2..3: 78467 3..4: 49605 4..5: 30742 5..10: 66630 10..50: 59718 50..100: 8051 100..500: 6800 500..1000: 997 1000..5000: 771 5000..10000: 87 10000..50000: 47

50000: 0

nkiru-ede commented 3 months ago

@jensdietrich I have reviewed this. However, I am getting different results from what you have. your table showed that there are 981, 506 GAVs with 0 dependencies.

Below is what I have:

Dependency Distribution of GAV: count [0, 1) 0 [1, 3) 694362 [3, 6) 203927 [6, 11) 85103 [11, 51) 70102 [51, 101) 8893 [101, 1000) 8587

image

jensdietrich commented 3 months ago

@nkiru-ede so acc. to your numbers there are no artifacts without dependents. How about this:

  1. Consider the GAV org.apache.directory.shared:shared-ldap-client-all:1.0.0-M8.
  2. This is a valid vertex: (line 67 is release_all.tsv)
  3. This occurs in only one line in links_all.tsv - line 283: "org.apache.directory.shared:shared-ldap-client-all:1.0.0-M8","org.slf4j:slf4j-api:1.6.1","Compile"
  4. This line is a dependency on org.slf4j:slf4j-api:1.6.1, i.e. org.apache.directory.shared:shared-ldap-client-all:1.0.0-M8 has no dependents and therefore the value for [0, 1) cannot be null!

Those are very easy to find, you just need to sample the data.

cc @ulizue

nkiru-ede commented 3 months ago

@jensdietrich

Dependency Distribution of GAV: [0, 1) 893457 [1, 3) 695087 [3, 6) 203617 [6, 11) 84859 [11, 51) 69969 [51, 101) 8878 [101, 1000) 8564 [1000, 5000) 806

Initially, I was checking the edges with the links_all dataset which is the dataset that contains and maps out the relationship between source(dependant) and target(main*). If you are checking scenarios/dependencies of source, is this not a case of the data not being available in the dataset or you are saying all artifacts in the dataset are supposed to have dependants?

jensdietrich commented 3 months ago

sorry @nkiru-ede I am not following - can you please rephrase ? What do you mean by main here ? The latest numbers are reasonable, but I would still like to understand what you have changed / did before to ensure that they are not "accidentally about right". cc @ulizue

nkiru-ede commented 3 months ago

@jensdietrich I mean that initially I was just checking for dependents of the target artifacts and this results to 0 artifacts with 0 dependencies according to the edge/dependencies dataset(links_all).

The new chart is from when I take into account the dependencies of the entire artifact in the dataset(source and target)

jensdietrich commented 3 months ago

@nkiru-ede ok lets discuss Friday when we meet, I still dont get it. Is it that you first ignored vertices without edges ?