Closed carlosparadis closed 1 year ago
The performance bottleneck lies here: https://github.com/sailuh/kaiaulu/blob/b91fce027a09c641389265427697b977f132d46e/R/graph.R#L84
cell_indices
is a numerical vector, serving as a loop.
https://github.com/sailuh/kaiaulu/blob/b91fce027a09c641389265427697b977f132d46e/R/graph.R#L41
The code is performing a row-by-row look-up:
https://github.com/sailuh/kaiaulu/blob/b91fce027a09c641389265427697b977f132d46e/R/graph.R#L43-L61
And then one more time when edges needs to be duplicated:
https://github.com/sailuh/kaiaulu/blob/b91fce027a09c641389265427697b977f132d46e/R/graph.R#L63-L81
This is fine on 100k edgelists which SDSMs usually have, but this will not scale to 7 million edges, which normally the HDSM has.
This also raises anther question on whether the analysis of HDSMs should also include files not present in the SDSM (since they may have been renamed or deleted), when merged.
Commit c2a1a82 vectorize most of the steps. But the code still runs too slow. The next step is trying to perform the projection directly on graph_to_dsmj
.
In the end, the projection function was left not modified. Here's what was done:
This removed the getCell and the equivalent reverse edge re-run and consolidated into one with some vectorization (getCell iterated via lapply on an indice, hence not vectorized). src_index and dest_index were also moved outside the iterator.
This also moves the local dcast calls to global dcasts. Cuts execution to 1/3 of the time above.
Even with all the optimizations, the function still ran indefinitely and over night. The actual fix was adding the function filter_by_commit_size()
, although the vectorization always helps. The problem occurred due to edge explosion during the projection of the git log.
Migration commits like these in Geronimo: https://github.com/apache/geronimo/commit/b60bf0a205e0257cb3279b08fb6c8d48bc7ce67a when projected create a large number of outlier edges. E.g. in the commit above, 1522 files were changed in a single commit. When calculating co-change, this will lead to 1522 Choose 2
= 1,157,481 edges. The filter function filters these commits soon after project_git
is parsed to prevent that.
In geronimo case, the pipeline did not end because each of 216 commits changed at least 30 files. This resulted in a grand total of over 7 million edges. The removal of said commits reduced the number of edges to 70k, which reduced the execution time from indefinitely to less than 30 minutes. It seems the cut-off lies somewhere around 500k - 1 million edges on my local machine. I'm sure the function could be parallelized, but DV8 seems to also take the same approach, and theoretically it does not make sense to consider migration commits as co-change.
The new filter function was added to two config files, but it should be added to the rest and the other notebooks before merging this branch to master.
It should be very easy to manually check the effect of the filter function, or more generally just inspect project_git to see what are the large commits. If migration or something else on a per project basis.
The issue likely lies in the internal
projectiongraph_to_dsmj function. In the meantime, an alternative solution is:Implement a
dsmj_to_graph
(the parse_dependencies is basically that), use DV8's gitlog parser, load to R, apply the filter, then use the graph_to_dsmj.