jtlandis / Comp683-Proj

Course Project For Comp683
0 stars 0 forks source link

`SLICER::assign_branches()` fails... #1

Open jtlandis opened 4 months ago

jtlandis commented 4 months ago

I have downloaded some data to use for the project. So far SLICER::assign_branches() seems to fail with an error. Specifically in a recursive call

  if (mean_dist < min_dist) {
          min_dist = mean_dist
          min_i = i
          min_j = j
   }

where the variable assigned to mean_dist results in a NaN type due to an upstream mean calculation.

Updating it to the following:

  if (!is.nan(mean_dist) && mean_dist < min_dist) {
          min_dist = mean_dist
          min_i = i
          min_j = j
   }

bypasses that error, only to result in a newish one.

Overall, it is looking like I need to refactor this function as this larger dataset does NOT work.

jtlandis commented 4 months ago

I have spent some time refactoring the code to be a bit more R-like. Admittedly, I may be making assumptions about the code whithout having read through their methods/references to completion.

here is something I have noticed in this block:

  rec_num_branches = max(recurse_br)
  if (rec_num_branches > 1) { 
    add_this = max(branch_assignments)
    for (j in 2:rec_num_branches) {
      recurse_br[recurse_br == j] = add_this + j - 1
    }
    recurse_br[recurse_br == 1] = max(branch_assignments)
    branch_assignments[branch_i] = recurse_br
  }

The above block is within a loop that iterates over chunks of a variable called branch_assignment, starting at it's second index, and up to the maximum branches detected.

Lets assume branch_assignment and recurse_br had the following counts, where recurse_br is defined by the a sub-cluster-assignment on branch_assignment==2

branch_assignments
  1   2   3 
  8 100  55 

 recurse_br
 1  3  4  5  6  7 
 9 25 19 18 15 14 

At the end of the block, all values of brach_assignment==2 are reassigned, but not to what you may be expecting. This is the following results:

branch_assignments
 1  3  8  9 
 8 64 34 57 

within the for loop indexed by j, we are updating the values of recurse_br inplace. This may not usually be an issue, expect that add_this == 3, which is within the range of recurse_br! This results in the merging of values within recurse_br like so:

# j goes form 2:7
j == 2: (no updates)
j == 3:  3 --> 5 (43)
j == 4:  4 --> 6 (34)
j == 5:  5 --> 7 (57)
j == 6:  6 --> 8 (34)
j == 7:  7 --> 9  (57)

Again, I have no idea as to if this is intentional by the author. I imagine the reason this was to shift the new subcluster labels so that they do not overlap with the original labels, but that is unclear since the source code does not contain developer comments

This is is not the end of my confusion either, for some reason, the package intentionally merges the first branch of recurse_br with the maximum assignment in branch_assignments.

to go to our previous example:

# recurse_br   ---->   branch_assignments
#        1     ---->             3
#       #9     ---->        9 + 55 = 64

Trying to rationalize what the algorithm is doing in this specific context.

  1. We identified 3 main branches (1: 8, 2: 100, 3: 55)
  2. We subdivided the branch 2, for some reason, merging groups (3, 5, 7) and (4, 6) together.
  3. group 1, which is generally the starting point for any branch assignment, gets merged with the original split point, supposedly because it may be the closest place to merge at.

I am thinking about changing the for loop to look something like:

not_first <- recurse_br > 1
recurse_br[not_first]  <- recurse_br[not_first] + add_this - 1

The above should prevent merging of sub groups... Alternatively the merging may be okay for some reason not yet known to me.