nkiru-ede / MavenNetworkStudy

Other
0 stars 0 forks source link

issues in dataset #10

Open jensdietrich opened 3 weeks ago

jensdietrich commented 3 weeks ago

I independently build a graph using jgrapht from the zenodo dataset, using a directed acyclic graph as target structure. There are three types of errors :

  1. adding a new edge would induce a cycle
  2. adding a new edge would induce a loop (not sure how this is different from 1)
  3. an edge references a GAV as source or target that has no timestamp

@nkiru-ede can you please investigate based on the example below, come up with a short paragraph discussing a concrete example ? I have some idea what happens for 3, but I am puzzled by 1 and 2.

error adding edge org.bytedeco.javacpp-presets:flandmark:1.07-1.2 -> org.bytedeco.javacpp-presets:flandmark:1.07-1.2 [Compile] , details: loops not allowed
error adding edge org.bytedeco.javacpp-presets:gsl:2.1-1.2 -> org.bytedeco.javacpp-presets:gsl:2.1-1.2 [Compile] , details: loops not allowed
error adding edge org.bytedeco:javacpp-presets:1.2 -> org.bytedeco:javacpp-presets:1.2 [Compile] , details: loops not allowed
error adding edge org.bytedeco.javacpp-presets:chilitags:master-1.2 -> org.bytedeco.javacpp-presets:chilitags:master-1.2 [Compile] , details: loops not allowed
error adding edge org.bytedeco.javacpp-presets:mxnet:master-1.2 -> org.bytedeco.javacpp-presets:mxnet:master-1.2 [Compile] , details: loops not allowed
no dependency type found (will use default other): )
no timestamp found for null in line ""org.javapos:javapos-controls:1.6.0","org.javapos:javapos-contracts:1.6.[0,)","Compile"" - will ignore dependencies
error adding edge org.bytedeco.javacpp-presets:openblas:0.2.19-1.2 -> org.bytedeco.javacpp-presets:openblas:0.2.19-1.2 [Compile] , details: loops not allowed
error adding edge de.sonia.portal.system:sonia-portal-system:1.0.1 -> de.sonia.portal:sonia-portal-parent-base:3 [Compile] , details: Edge would induce a cycle
error adding edge net.quasardb:qdb:0.0.2 -> net.quasardb:qdb:0.0.2 [Runtime] , details: loops not allowed
error adding edge net.quasardb:qdb:0.0.2 -> net.quasardb:qdb:0.0.2 [Compile] , details: loops not allowed
error adding edge net.quasardb:qdb:0.0.1 -> net.quasardb:qdb:0.0.1 [Runtime] , details: loops not allowed
error adding edge net.quasardb:qdb:0.0.1 -> net.quasardb:qdb:0.0.1 [Compile] , details: loops not allowed
error adding edge org.webjars.npm:global-modules:1.0.0 -> org.webjars.npm:resolve-dir:1.0.1 [Compile] , details: Edge would induce a cycle
errors adding edges: 872
vertex count: 1965365
edge count: 9699490
nkiru-ede commented 3 weeks ago

@jensdietrich

  1. I did a cycle check using NetworkX - cycle error count is - 755

and this is because according to the dataset, (the target artifact also depends on the source artifact). This is trues for 755 of the relationships

error adding edge de.sonia.portal.system:sonia-portal-system:1.0.1 -> de.sonia.portal:sonia-portal-parent-base:3 [Compile] , details: Edge would induce a cycle

in your above log, (de.sonia.portal:sonia-portal-parent-base:3) also depends on de.sonia.portal.system:sonia-portal-system:1.0.1

  1. Loop check - count = (117) These are situations where source and target artifacts are the same -

eg: error adding edge org.bytedeco.javacpp-presets:openblas:0.2.19-1.2 -> org.bytedeco.javacpp-presets:openblas:0.2.19-1.2 [Compile] , details: loops not allowed

error adding edge org.bytedeco.javacpp-presets:flandmark:1.07-1.2 -> org.bytedeco.javacpp-presets:flandmark:1.07-1.2 [Compile] , details: loops not allowed

and all the other scenarios in your log

  1. missing timestamps in the dataset. count of missing timestamps for either source or target = 0. I could not find scenarios where source/target artifact do not have release dates after merging the links_all and release_all datasets.
image
jensdietrich commented 3 weeks ago

@nkiru-ede thanks for this -- how are other papers deal with cycles ? We could just exclude them from the further analysis. The self-cycle seems to be wrong -- see https://mvnrepository.com/artifact/org.bytedeco.javacpp-presets/openblas/0.2.19-1.2 -- perhaps the dataset has errors ?

nkiru-ede commented 3 weeks ago

Yes, this is according to the dataset -

image image

I scanned through the 49 papers that used or in one or the other mentioned (cited) the dataset on google scholar and none seem to mention any irregularities with the dataset
https://scholar.google.com/scholar?start=0&hl=en&as_sdt=2005&sciodt=0,5&cites=1640956409657446853&scipsc=

@jensdietrich

jensdietrich commented 2 weeks ago

To investigate the cycle issues further, we need to complete the following two steps:

  1. how many cycles / loops are there - coompute the number of max-size SCCs (strongly connected components) and their size (histogram)
  2. find our why projects create circular dependencies (look at poms, project logs / commits, stackoverflow, tutorials etc) - for instance this one - here loops seem to be caused by a dependency to self with a special classifier. How does this work during a build ? Look a build instructions or try.
nkiru-ede commented 2 weeks ago

There is one scc with 887 nodes

image

image

There are 755 cycles and 117 loops.

For the above highlighted component [net:quasardb:qdb:0.0.2] - I was able to build the project successfully as well as this: [net:quasardb:qdb:0.0.1]

image

The same is the case for other loops, they build successfully, so I would say that this doesn't affect the build process.

Artifacts with circular dependencies----

I created a maven project, created two maven classes that depend on each other.

image

this compiled successfully too without throwing any errors related to circular dependencies. So I would again note that this does not affect the build process - at least for small projects

image

I compile, dependency:analyze this case where circular dependency was detected - [de.sonia.portal.system:sonia-portal-system:1.0.1 -de.sonia.portal.system:sonia-portal-parent-base:3] - These were fine too

even though some articles have pointed to this being problematic - for larger projects

As for the reason people create circular dependencies : some of the reasons that I see appear to be poor project management/code quality.

Just like you mentioned, I also saw that developers sometimes use jni classifiers to declare resources that are needed by various platform/environments, network based operations, wrap existing libraries,

https://stackoverflow.com/questions/50422690/what-is-the-real-world-application-of-java-native-interface

https://softwareengineering.stackexchange.com/questions/210278/is-rewriting-some-java-code-to-c-using-jni-to-improve-performance-a-good-idea

https://www.researchgate.net/publication/324143862_Evaluating_the_Java_Native_Interface_JNI_Leveraging_Existing_Native_Code_Libraries_and_Threads_to_a_Running_Java_Virtual_Machine

@jensdietrich

jensdietrich commented 6 days ago

@nkiru-ede thanks for the analysis -- re: "There are 755 cycles and 117 loops" -- what does this mean ? We basically have SCCs and should count the number of maximal SCCs. Then report the size (perhaps this is what the histogram is). How does 755/117 relate to the histogram? The histogram shows 1 (?) very large SCC of >800 artifacts. How does this come about ?

TBC ..