Open jtlandis opened 4 months ago
I have spent some time refactoring the code to be a bit more R-like. Admittedly, I may be making assumptions about the code whithout having read through their methods/references to completion.
here is something I have noticed in this block:
rec_num_branches = max(recurse_br)
if (rec_num_branches > 1) {
add_this = max(branch_assignments)
for (j in 2:rec_num_branches) {
recurse_br[recurse_br == j] = add_this + j - 1
}
recurse_br[recurse_br == 1] = max(branch_assignments)
branch_assignments[branch_i] = recurse_br
}
The above block is within a loop that iterates over chunks of a variable called branch_assignment
, starting at it's second index, and up to the maximum branches detected.
Lets assume branch_assignment
and recurse_br
had the following counts, where recurse_br
is defined by the a sub-cluster-assignment on branch_assignment==2
branch_assignments
1 2 3
8 100 55
recurse_br
1 3 4 5 6 7
9 25 19 18 15 14
At the end of the block, all values of brach_assignment==2
are reassigned, but not to what you may be expecting. This is the following results:
branch_assignments
1 3 8 9
8 64 34 57
within the for
loop indexed by j
, we are updating the values of recurse_br
inplace. This may not usually be an issue, expect that add_this == 3
, which is within the range of recurse_br
! This results in the merging of values within recurse_br
like so:
# j goes form 2:7
j == 2: (no updates)
j == 3: 3 --> 5 (43)
j == 4: 4 --> 6 (34)
j == 5: 5 --> 7 (57)
j == 6: 6 --> 8 (34)
j == 7: 7 --> 9 (57)
Again, I have no idea as to if this is intentional by the author. I imagine the reason this was to shift the new subcluster labels so that they do not overlap with the original labels, but that is unclear since the source code does not contain developer comments
This is is not the end of my confusion either, for some reason, the package intentionally merges the first branch of recurse_br
with the maximum assignment in branch_assignments
.
to go to our previous example:
# recurse_br ----> branch_assignments
# 1 ----> 3
# #9 ----> 9 + 55 = 64
Trying to rationalize what the algorithm is doing in this specific context.
(1: 8, 2: 100, 3: 55)
(3, 5, 7)
and (4, 6)
together. I am thinking about changing the for loop to look something like:
not_first <- recurse_br > 1
recurse_br[not_first] <- recurse_br[not_first] + add_this - 1
The above should prevent merging of sub groups... Alternatively the merging may be okay for some reason not yet known to me.
I have downloaded some data to use for the project. So far
SLICER::assign_branches()
seems to fail with an error. Specifically in a recursive callwhere the variable assigned to
mean_dist
results in aNaN
type due to an upstream mean calculation.Updating it to the following:
bypasses that error, only to result in a newish one.
Overall, it is looking like I need to refactor this function as this larger dataset does NOT work.