snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Linear increase in labels leads to exponential increase in LabelModel fit time. #1486

Closed dataframing closed 4 years ago

dataframing commented 5 years ago

First things first: thank you all for all the hard work going into this library! It's fantastic and incredibly accessible, especially with the numerous tutorials. Onto the issue:


Issue description

When attempting to use snorkel.labeling.LabelModel for more than ~8 classes, the call to _break_col_permutation_symmetry (more specifically, permutations(range(k)) where k is the number of classes) increases runtime at an exponential rate.

Code example/repro steps

I've isolated a minimal-ish example below (forgive the improper use of globals):

from typing import Dict, List, Optional, Tuple
import time

from sklearn.datasets import make_classification
from snorkel.labeling import LabelingFunction, LabelModel, PandasLFApplier
import numpy as np
import pandas as pd

# Constants.
N_CLASSES: Optional[int] = None
ABSTAIN: int = -1

def generate_data() -> Tuple[pd.DataFrame, np.ndarray]:

    # Generate dummy dataset.
    X, y = make_classification(
        n_samples=50_000, 
        n_features=N_CLASSES * 2,
        n_classes=N_CLASSES,
        n_informative=N_CLASSES,
        n_clusters_per_class=1,
        random_state=2019,
    )

    # Assert we properly created our dataset.
    assert X.shape[0] == y.shape[0]
    assert len(set(y)) == N_CLASSES

    # Cast our array to DataFrame.
    X_df: pd.DataFrame = pd.DataFrame(X)
    return X_df, y

def greater_than_lf(x: pd.Series, cutoff: float, label: int) -> int:
    return label if x[label] > cutoff else ABSTAIN

def make_ge_lf(cutoff: float, label: int) -> LabelingFunction:
    return LabelingFunction(
        name=f"ge-{cutoff:g}_label-{label}",
        f=greater_than_lf,
        resources=dict(cutoff=cutoff, label=label)
    )

def run():

    # Generate data.
    X_df, y = generate_data()

    # Generate very bad, not good labeling functions.
    lfs: List[LabelingFunction] = [
        make_ge_lf(cutoff=0.5, label=label) for label in range(N_CLASSES)
    ]

    # Initialize and apply our applier onto our training data.
    applier = PandasLFApplier(lfs=lfs)
    L_train: np.ndarray = applier.apply(df=X_df)

    # Initialize and fit our label model.
    label_model = LabelModel(
        cardinality=N_CLASSES,
        verbose=True,
        device="cpu"
    )

    tic: float = time.time()
    label_model.fit(
        L_train=L_train,
        n_epochs=10,
        lr=1e-3,
        log_freq=1,
        seed=2019
    )
    toc: float = time.time()
    print(f"Fitting label model with {N_CLASSES} classes took {toc - tic:.3f} seconds.")
for i in range(3, 12):
    N_CLASSES: int = i
    run()

This gives print logs like (excluding benign PyTorch UserWarnings re:torch.bool):

Fitting label model with 3 classes took 0.051 seconds.
Fitting label model with 4 classes took 0.037 seconds.
Fitting label model with 5 classes took 0.070 seconds.
Fitting label model with 6 classes took 0.257 seconds.
Fitting label model with 7 classes took 1.543 seconds.
Fitting label model with 8 classes took 13.348 seconds.
Fitting label model with 9 classes took 130.37 seconds.
Fitting label model with 10 classes took 1403.96 seconds.

Expected behavior

I suppose I expected the core model training loop (forward, loss, backprop, etc.) to be the significant consumer of runtime, especially with 10+ classes. I didn't expect the model to (attempt to) undergo a computational loop for NPermutations(k) iterations, which can be (very) costly.

Screenshots

N/A

System info

Additional context

I'm trying to perform a single-label, multi-class classification problem with number of classes k ≈ 17. I wrote k such labeling functions (with great ease! Thank you for the incredible interface), applied them onto a validation set, applied them onto the training set, then attempted to fit a LabelModel. I noticed the call wasn't returning, and thought it was the training loop. After debugging, I noticed that the call to _break_col_permutation_symmetry was the culprit, and upon further inspection I realized that doing 17! iterations would be 355,687,428,096,000 (355 trillion) iterations, which even if each loop took 1 nanosecond (1e-9 seconds), would take ~247 days.

I noticed that the number of labels in the tutorials seem to limit the number of non-abstain classes to ~3 or 4, which is probably reasonable. I wonder if there's any consideration for how to have snorkel handle larger numbers of classes, especially for cases with really large numbers of classes (e.g., ImageNet's 1000 classes). I had the following ideas, but not sure if they're sensible given Snorkel's need to view all labeling functions and adjust the mu parameter accordingly:

I'd appreciate any advice here! I completely believe I could be missing something very obvious, so any feedback is welcome. Thank you again for your time on this project!

plison commented 5 years ago

I also stumbled upon the exact same problem. The only (hacky) solution I found what to modify the code in _break_col_permutation_symmetry to sample a limited number of possible permutations instead of looping on all possible ones. But that's obviously a non-optimal solution.

henryre commented 5 years ago

Hi @dataframing, thanks for digging on this! Agreed, 247 days is not an ideal training time. As a quick fix, your random search suggestion seems like a good approach! If you want to submit a PR, I'm happy to help with the process!

dataframing commented 5 years ago

@henryre Sounds good! Wrapping up a few tests now. I'll open up a PR shortly. It'd be great to have your + @plison's eyes on it, as well!

eggie5 commented 5 years ago

does the devices flag have any effect on this? Namely using Cuda/GPU?

dataframing commented 5 years ago

@eggie5 I don't think so, since the core loop is multiplying matrices with NumPy and iterating over the permutation space of k!, which is expensive no matter how you cut it for large enough k.

eggie5 commented 5 years ago

so no workaround at this point?

dataframing commented 5 years ago

From the discussion on #1488, it seems like the team is working on a better formalization of the label model that won't have this issue anymore.

Somewhat informally, I realized that even with random sampling a decent number of permutations (say, 50MM), the label model failed to find a permutation (and thus update the mu instance attribute) that was much better than random. I'm guessing the instances of valid mu parameters is quite sparse within the space of all k! permutations.

(Edit: I should note that the suggested LabelModel in that PR updates the self.mu instance attribute regardless of whether it should or not. I haven't updated the code in that PR, but see @plison's comment on when we should update mu and it should be pretty clear.)

In my case, I resorted to using the majority vote label model. I also focused my time on building (slightly fewer, and slightly less coverage) but very-high-precision rules. This gave me decent performance, and I kept iterating on these rules (introducing new rules, updating existing rules to be more precise, etc) until the majority vote classifier was doing a pretty good job.

Hope this holds you over! Like you, I have my fingers crossed that the team can pull off an update to the LabelModel relatively soon. Best of luck!

ajratner commented 5 years ago

Hi all, sorry for the delay here- can check out @brahmaneya 's solution (referenced above) which is just pending final edits and then will get merged in, after which we'd love your feedback! Thanks!

eggie5 commented 5 years ago

nice, I see it's in master, please let us know when it's released 👍

henryre commented 4 years ago

Hi all, #1502 is now in v0.9.3. I reran @dataframing's example above and all models trained in under 0.1s on a MacBook. Going to close this out but feel free to reopen if you're seeing any issues.