Closed dataframing closed 4 years ago
I also stumbled upon the exact same problem. The only (hacky) solution I found what to modify the code in _break_col_permutation_symmetry
to sample a limited number of possible permutations instead of looping on all possible ones. But that's obviously a non-optimal solution.
Hi @dataframing, thanks for digging on this! Agreed, 247 days is not an ideal training time. As a quick fix, your random search suggestion seems like a good approach! If you want to submit a PR, I'm happy to help with the process!
@henryre Sounds good! Wrapping up a few tests now. I'll open up a PR shortly. It'd be great to have your + @plison's eyes on it, as well!
does the devices
flag have any effect on this? Namely using Cuda/GPU?
@eggie5 I don't think so, since the core loop is multiplying matrices with NumPy and iterating over the permutation space of k!
, which is expensive no matter how you cut it for large enough k
.
so no workaround at this point?
From the discussion on #1488, it seems like the team is working on a better formalization of the label model that won't have this issue anymore.
Somewhat informally, I realized that even with random sampling a decent number of permutations (say, 50MM), the label model failed to find a permutation (and thus update the mu
instance attribute) that was much better than random. I'm guessing the instances of valid mu
parameters is quite sparse within the space of all k!
permutations.
(Edit: I should note that the suggested LabelModel
in that PR updates the self.mu
instance attribute regardless of whether it should or not. I haven't updated the code in that PR, but see @plison's comment on when we should update mu
and it should be pretty clear.)
In my case, I resorted to using the majority vote label model. I also focused my time on building (slightly fewer, and slightly less coverage) but very-high-precision rules. This gave me decent performance, and I kept iterating on these rules (introducing new rules, updating existing rules to be more precise, etc) until the majority vote classifier was doing a pretty good job.
Hope this holds you over! Like you, I have my fingers crossed that the team can pull off an update to the LabelModel
relatively soon. Best of luck!
Hi all, sorry for the delay here- can check out @brahmaneya 's solution (referenced above) which is just pending final edits and then will get merged in, after which we'd love your feedback! Thanks!
nice, I see it's in master, please let us know when it's released 👍
Hi all, #1502 is now in v0.9.3. I reran @dataframing's example above and all models trained in under 0.1s on a MacBook. Going to close this out but feel free to reopen if you're seeing any issues.
First things first: thank you all for all the hard work going into this library! It's fantastic and incredibly accessible, especially with the numerous tutorials. Onto the issue:
Issue description
When attempting to use
snorkel.labeling.LabelModel
for more than ~8 classes, the call to_break_col_permutation_symmetry
(more specifically,permutations(range(k))
wherek
is the number of classes) increases runtime at an exponential rate.Code example/repro steps
I've isolated a minimal-ish example below (forgive the improper use of globals):
This gives print logs like (excluding benign PyTorch UserWarnings re:
torch.bool
):Expected behavior
I suppose I expected the core model training loop (forward, loss, backprop, etc.) to be the significant consumer of runtime, especially with 10+ classes. I didn't expect the model to (attempt to) undergo a computational loop for
NPermutations(k)
iterations, which can be (very) costly.Screenshots
N/A
System info
pip
(viapipenv
)==0.9.1
torch==1.1.0.post2
,numpy==1.17.2
Additional context
I'm trying to perform a single-label, multi-class classification problem with number of classes
k ≈ 17
. I wrotek
such labeling functions (with great ease! Thank you for the incredible interface), applied them onto a validation set, applied them onto the training set, then attempted to fit aLabelModel
. I noticed the call wasn't returning, and thought it was the training loop. After debugging, I noticed that the call to_break_col_permutation_symmetry
was the culprit, and upon further inspection I realized that doing17!
iterations would be 355,687,428,096,000 (355 trillion) iterations, which even if each loop took 1 nanosecond (1e-9 seconds), would take ~247 days.I noticed that the number of labels in the tutorials seem to limit the number of non-abstain classes to ~3 or 4, which is probably reasonable. I wonder if there's any consideration for how to have snorkel handle larger numbers of classes, especially for cases with really large numbers of classes (e.g.,
ImageNet
's 1000 classes). I had the following ideas, but not sure if they're sensible given Snorkel's need to view all labeling functions and adjust themu
parameter accordingly:k
one-versus-restLabelModel
s and aggregate their predictions accordingly. This seems more appropriate if we were doing a multi-label, binary classification approach. But I can imagine a case where we treat the problem as such, then squash and apply some softmax to the predictions. A bit hacky, but it would work?MajorityLabelVoter
.I'd appreciate any advice here! I completely believe I could be missing something very obvious, so any feedback is welcome. Thank you again for your time on this project!