snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Why is LabelModel.fit taking so long? #1537

Closed regstrtn closed 4 years ago

regstrtn commented 4 years ago

I have a dataset of 300 K search queries, and each of these queries have a category label against them (fashion, electronics etc). There are a total of 1461 such categories in the data. I have written 15 labeling functions (for top 15 categories), with very little overlap and conflict among them. With this set up, and training only on a CPU, my LabelModel takes several (4-6) hours to train. Could anyone help me with what might I be doing wrong?

label_model = LabelModel(cardinality = 1461, verbose = True)
label_model.fit(L_train = L_train, lr = 0.01, log_freq = 1, seed = 1, n_epochs = 50)
j Polarity Coverage Overlaps Conflicts Correct Incorrect Emp. Acc.
0 [364] 0.006 0.000 0.000 1762 32 0.982
1 [665] 0.004 0.000 0.000 264 1008 0.208
2 [668] 0.004 0.000 0.000 844 429 0.663
3 [658] 0.002 0.000 0.000 140 519 0.212
4 [664] 0.002 0.000 0.000 90 414 0.179
5 [684] 0.001 0.000 0.000 445 5 0.989
6 [674] 0.004 0.000 0.000 1179 135 0.897
7 [706] 0.006 0.000 0.000 1546 362 0.810
8 [685] 0.001 0.000 0.000 158 210 0.429
9 [266] 0.001 0.000 0.000 220 113 0.661
10 [262] 0.007 0.000 0.000 1798 499 0.783
11 [233] 0.000 0.000 0.000 95 38 0.714
12 [1433] 0.001 0.000 0.000 186 74 0.715
13 [1009] 0.000 0.000 0.000 59 0 1.000
14 [268] 0.001 0.000 0.000 269 0 1.000

System info

snorkel version == 0.9.3

ajratner commented 4 years ago

Hi @regstrtn apologies for the delayed response and the difficulties here. To start with a few high level points:

Anyway, could be a specific bug here, we will look into, but probably just related to above

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.