snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Need partial_fit method in LabelModel #1634

Closed svjack closed 3 years ago

svjack commented 3 years ago

Because you reset random seed in fit function, it seems that this framework not support partial_fit. If i have a label train array with big size, the feature of sparse data format support and partial_fit (as sklearn do) support is needed.

bhancock8 commented 3 years ago

Hi @svjack, thanks for this suggestion. I've added the "help wanted" tag to this feature request, and we're happy to give feedback and reviews on any PR in this direction submitted by yourself or others!

svjack commented 3 years ago

Hi @svjack, thanks for this suggestion. I've added the "help wanted" tag to this feature request, and we're happy to give feedback and reviews on any PR in this direction submitted by yourself or others!

And i also want to give a suggestion on labeling function filter from a big collection of labeling functions. I think snorkel should provide a interface on filter some labeling function by some metrics provide by other framework such as sklearn. For example, I have 2880 labeling function, which is generated by some order permutation of rules. If i use all them into construction of L_train, then in the fit process of LabelModel seems not works ( when provide Y_dev and class_ratio_dictionary) (will yield NA value, and give advice of decrease the learning rate). Because these labeling functions are used to give label to a imbalanced classification problem. I use sklearn’s balance_accuracy_score filter out some label function with high value, then on this subset of labeling function, the LabelModel converged. This provide me a idea to use snorkel LabelModel as a labeling function filter toolkit. So i think for many labeling function problem, snorkel should provide a interface to pre-filter some good one (or give user’s some choose labeling function advice), to make LabelModel can fit them.

talolard commented 3 years ago

Hi @svjack , I was hitting the same problems and have a PR in the works for sparse input.

Came here to say that (I think) partial fit doesn't make sense here, because the actual training is independent of the number of documents. Instead your optimizing a matrix of size (num_funs*num_classes).

Generally the slow part of Snorkel is transforming the Label Matrix (rows are examples, columns are functions, values are classes) into the optimization objective. Unfortunatly partial fit won't help you there, because that transformation needs the entire dataset.

Only god knows when I'll finish my PR, but it's mostly blocked on code cleanliness. I'd suggest you try to clone it and calculate your lfs in a sparse fashion. You can take a look at the tests for a sense of how to structure the data in a spare way and use the sparse Label Models.

svjack commented 3 years ago

Hi @svjack , I was hitting the same problems and have a PR in the works for sparse input.

Came here to say that (I think) partial fit doesn't make sense here, because the actual training is independent of the number of documents. Instead your optimizing a matrix of size (num_funs*num_classes).

Generally the slow part of Snorkel is transforming the Label Matrix (rows are examples, columns are functions, values are classes) into the optimization objective. Unfortunatly partial fit won't help you there, because that transformation needs the entire dataset.

Only god knows when I'll finish my PR, but it's mostly blocked on code cleanliness. I'd suggest you try to clone it and calculate your lfs in a sparse fashion. You can take a look at the tests for a sense of how to structure the data in a spare way and use the sparse Label Models.

Thinks to point out it. I finally use PandasParallelLFApplier (with the help of dask) to deal with the transformation of big Label Matrix problem with parallel num setting in multi-core machine. and because lfs are apply in a column apply way, i sliced the original big matrix into small pieces. and use scipy stack to concatenate them. see here ‘s reconstruct_data function.

I want to know is do you have a realistic dataset example to use your sparse_label_model ? Because i think the performance of it should have a real test. As a validation of Snorkel’s LabelModel also perform fine with too big data ? I think its not a “to sparse” engineering problem. But the convergence performance of Snorkel on big data. I will review your sparse implementation, and want to have some detail guide of your implementation. Such as EventCooccurence you use in this project. Can you explain me your construction ideas ?

talolard commented 3 years ago

I actually wrote a blog post about it here

svjack commented 3 years ago

I actually wrote a blog post about it here

where is your fit_from_objective definition in fit_from_sparse_example_event_list ? I review the blog. you use sql statement to truly compute labeling function as column in sql table. I think this should be improved by user defined function in python, and register it into database. (such as udf in spark or sqlite-transform do to use generally python modules in your sql statement) and as your blog do, init object matrix from sql engine to ndarray. I hope you can provide a interface to first dump data into db table ,perform lfs labeling and object matrix construction by sql statement , finally read matrix back as python object to snorkel. I hope your full code release.

svjack commented 3 years ago

And i think you can replace database in your blog by Spark or Dask to make it more compatible with Snorkel, use they can define and use labeling function more conveniently , they also have resource control capacity between memory and data warehouse. (Even support use visual sql statement and visual dataframe)

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.