note - I wouldn't merge this yet. But encouraging the group to review this.
Most important changes that I still intend to implement
refactor the frequency plotting for agg
refactor all aggregation functions to accept line collection numpy array (currently only used by merge)
account for different programming langugages (probably end up adding a top level dimension), will need to change this in the analyzer as well
Outline of scalek merge operation:
scalek receives an obj, be it file_objs or method_objs, as long as it is a list and each element contains a list of line_objs
based on the length of the line_obj list, these elements are moved into k clusters (methods/files of similar length are grouped together)
if a line list length is greater than the median of this cluster, we divide the list of lines into (median file length of this group) chunks.
for each of these chunks, we reduce this to one element through our downscaling method (default: we keep the mode)
for upscaling, we consider additional lines to be blank (pretty much nearest neighbor method)
at this point, all of our lists should contain the same number of lines (which is the median filesize)
we merge this group as well, taking the mean across all the line metrics, leaving us with one representative sample for each of our k distinctive groups.
note - I wouldn't merge this yet. But encouraging the group to review this. Most important changes that I still intend to implement
Outline of scalek merge operation: