graph_to_dsmj runs too slow

carlosparadis commented 1 year ago

The issue likely lies in the internal ~~projection~~ graph_to_dsmj function. In the meantime, an alternative solution is:

Implement a dsmj_to_graph (the parse_dependencies is basically that), use DV8's gitlog parser, load to R, apply the filter, then use the graph_to_dsmj.

carlosparadis commented 1 year ago

The performance bottleneck lies here: https://github.com/sailuh/kaiaulu/blob/b91fce027a09c641389265427697b977f132d46e/R/graph.R#L84

cell_indices is a numerical vector, serving as a loop.

https://github.com/sailuh/kaiaulu/blob/b91fce027a09c641389265427697b977f132d46e/R/graph.R#L41

The code is performing a row-by-row look-up:

https://github.com/sailuh/kaiaulu/blob/b91fce027a09c641389265427697b977f132d46e/R/graph.R#L43-L61

And then one more time when edges needs to be duplicated:

https://github.com/sailuh/kaiaulu/blob/b91fce027a09c641389265427697b977f132d46e/R/graph.R#L63-L81

This is fine on 100k edgelists which SDSMs usually have, but this will not scale to 7 million edges, which normally the HDSM has.

This also raises anther question on whether the analysis of HDSMs should also include files not present in the SDSM (since they may have been renamed or deleted), when merged.

carlosparadis commented 1 year ago

Commit c2a1a82 vectorize most of the steps. But the code still runs too slow. The next step is trying to perform the projection directly on graph_to_dsmj.

carlosparadis commented 1 year ago

In the end, the projection function was left not modified. Here's what was done:

First Refactoring Profiling

This removed the getCell and the equivalent reverse edge re-run and consolidated into one with some vectorization (getCell iterated via lapply on an indice, hence not vectorized). src_index and dest_index were also moved outside the iterator.

Second Refactoring Profiling

This also moves the local dcast calls to global dcasts. Cuts execution to 1/3 of the time above.

Actual fix

Even with all the optimizations, the function still ran indefinitely and over night. The actual fix was adding the function filter_by_commit_size(), although the vectorization always helps. The problem occurred due to edge explosion during the projection of the git log.

Migration commits like these in Geronimo: https://github.com/apache/geronimo/commit/b60bf0a205e0257cb3279b08fb6c8d48bc7ce67a when projected create a large number of outlier edges. E.g. in the commit above, 1522 files were changed in a single commit. When calculating co-change, this will lead to 1522 Choose 2 = 1,157,481 edges. The filter function filters these commits soon after project_git is parsed to prevent that.

In geronimo case, the pipeline did not end because each of 216 commits changed at least 30 files. This resulted in a grand total of over 7 million edges. The removal of said commits reduced the number of edges to 70k, which reduced the execution time from indefinitely to less than 30 minutes. It seems the cut-off lies somewhere around 500k - 1 million edges on my local machine. I'm sure the function could be parallelized, but DV8 seems to also take the same approach, and theoretically it does not make sense to consider migration commits as co-change.

To-dos

The new filter function was added to two config files, but it should be added to the rest and the other notebooks before merging this branch to master.

Threat to Validity

It should be very easy to manually check the effect of the filter function, or more generally just inspect project_git to see what are the large commits. If migration or something else on a per project basis.

sailuh / kaiaulu