snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Support Training With Sparse Matrices #1629

Closed talolard closed 3 years ago

talolard commented 3 years ago

Disclaimer

This PR isn't done. It does what it's supposed to do and has tests, but style and code cleanliness might not be there.

I'm not totally confidant this implementation is a fit. I'd appreciate if someone could take a look and let me know if I'm on track before I polish this. Maybe @bhancock8 who replied to the original issue ?

Description of proposed changes

Adds support for training and inference with sparse matrices.

This PR adds a few convenience functions to help the user work with sparse matrices representations of L_ind / or the objective matrix (do either have a formal name ? ).

I presume most users will call 'train_model_from_sparse_event_cooccurence', which takes a list of tuples representing L_ind indices and value (which is always 1), populate a sparse matrix and runs training.

train_model_from_sparse_event_cooccurence calls 'train_model_from_known_objective' which gets a dense numpy representation of O and trains. When I use Snorkel I call this function and calculate O elsewhere, it's faster.

Internally, there is some refactoring in LabelModel to support train_model_from_known_objective, constants are set differently and the tree and clique data calculations are moved a little.

Related issue(s)

Fixes #1625

Test plan

I wrote tests in test_sparse_data_helpers. Basically the tests create an L matrix in standard format, and then compare the output of normal Snorkel to Sparse Snorkel.

Checklist

Need help on these? Just ask!

codecov[bot] commented 3 years ago

Codecov Report

Merging #1629 (d579832) into master (ed77718) will decrease coverage by 0.94%. The diff coverage is 82.14%.

@@            Coverage Diff             @@
##           master    #1629      +/-   ##
==========================================
- Coverage   97.21%   96.26%   -0.95%     
==========================================
  Files          68       72       +4     
  Lines        2151     2276     +125     
  Branches      345      358      +13     
==========================================
+ Hits         2091     2191     +100     
- Misses         31       52      +21     
- Partials       29       33       +4     
Impacted Files Coverage Δ
...abel_model/sparse_example_eventlist_label_model.py 47.82% <47.82%> (ø)
...parse_label_model/sparse_event_pair_label_model.py 61.53% <61.53%> (ø)
snorkel/labeling/model/label_model.py 94.58% <89.18%> (-0.97%) :arrow_down:
...odel/sparse_label_model/base_sparse_label_model.py 91.30% <91.30%> (ø)
...l/sparse_label_model/sparse_label_model_helpers.py 100.00% <100.00%> (ø)
talolard commented 3 years ago

I think the coverage tool isn't picking up on some of the tests. Static methods in

sparse_example_eventlist_label_model.py and sparse_event_pair_label_model.py get tested explicitly in the two tests I marked with @pytest.mark.complex

github-actions[bot] commented 3 years ago

This pull request is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.