Wirg commented 4 years ago

Description of proposed changes

Feature

Currently, there is no way to decide the key to be memoized when using preprocessor(memoize=True). This leads to 2 issues :

memoization can not be done for unhashable classes (typically a group of pandas rows). We need to wrap or subclass it.
memoization key can not be specific to a preprocessing. Example : We are trying to evaluate the reliability of a paragraph in a blog post. We could evaluate the reliability of the paragraph and of the website. The preprocessing corresponding to those 2 tasks will share the same key for memoize, which is not ideal : a website can have a few thousand paragraphs so we will evaluate website reliability a lot more than necessary.

Result

@preprocessor(memoize=True, memoize_key=lambda p: p.base_website_url)
def add_website_reliability(paragraph):
    paragraph.website_reliability = evaluate_reliability(paragraph.base_website_url)
    return paragraph

Implementation

Add a memoize_key : Optional[HashingFunction] to the BaseMapper, if provided and not None, it will be used instead of get_hashable to define the hash of the input.

memoize_key has been made accessible to the different functions providing memoize api.

Related issue(s)

https://github.com/snorkel-team/snorkel/issues/1561

Test plan

Checklist

Need help on these? Just ask!

[x] I have read the CONTRIBUTING document.
[x] I have updated the documentation accordingly.
[x] I have added tests to cover my changes.
[x] I have run tox -e complex and/or tox -e spark if appropriate.
[x] All new and existing tests passed.

@henryre , I will be waiting for your review. I ran tox -e doc, but it did not produce any change and I have a bunch of WARNING: toctree contains reference to nonexisting document, is it normal ? By the way, how would you like to discuss my use case further ?

codecov[bot] commented 4 years ago

Codecov Report

Merging #1572 into master will increase coverage by 0.06%. The diff coverage is 92.30%.

@@            Coverage Diff             @@
##           master    #1572      +/-   ##
==========================================
+ Coverage   97.13%   97.19%   +0.06%     
==========================================
  Files          56       68      +12     
  Lines        2091     2137      +46     
  Branches      342      343       +1     
==========================================
+ Hits         2031     2077      +46     
  Misses         31       31              
  Partials       29       29

Impacted Files	Coverage Δ
...el/classification/training/loggers/checkpointer.py	`96.90% <ø> (ø)`
snorkel/labeling/lf/core.py	`100.00% <ø> (ø)`
snorkel/preprocess/core.py	`100.00% <ø> (ø)`
snorkel/slicing/sf/nlp.py	`100.00% <ø> (ø)`
snorkel/classification/training/trainer.py	`89.83% <50.00%> (ø)`
snorkel/labeling/model/label_model.py	`95.54% <50.00%> (ø)`
snorkel/labeling/lf/nlp.py	`100.00% <100.00%> (ø)`
snorkel/map/core.py	`100.00% <100.00%> (ø)`
snorkel/preprocess/nlp.py	`86.66% <100.00%> (ø)`
snorkel/slicing/utils.py	`94.80% <100.00%> (+0.13%)`	:arrow_up:
... and 14 more

Wirg commented 4 years ago

@henryre I just noticed it won't work in the expected way.

The snorkel pattern is to return the x_mapped so the cache will change the data point.

    def test_decorator_mapper_memoized_use_memoize_key(self) -> None:
        square_hit_tracker = SquareHitTracker()

        @lambda_mapper(memoize=True, memoize_key=lambda x: x.num)
        def square(x: DataPoint) -> DataPoint:
            x.num_squared = square_hit_tracker(x.num)
            return x

        x8 = self._get_x()
        x8_mapped = square(x8)
        assert x8_mapped is not None
        self.assertEqual(x8_mapped.num_squared, 64)
        self.assertEqual(square_hit_tracker.n_hits, 1)
        x8_with_another_text = self._get_x(text="Henry is still having fun")
        x8_with_another_text_mapped = square(x8_with_another_text)
        assert x8_with_another_text_mapped is not None
        self.assertEqual(x8_with_another_text_mapped.num_squared, 64)
        self.assertEqual(square_hit_tracker.n_hits, 1)
        # This should fail :/
        self.assertEqual(x8_with_another_text_mapped, x8_mapped)

henryre commented 4 years ago

Hi @Wirg, thanks for putting this up! Based on the example you put up, the expected behavior would be square(x8_with_another_text) == x8_mapped since the hashing function was (intentionally) "poorly chosen" in the test. Are you saying self.assertEqual(x8_with_another_text_mapped, x8_mapped) will trigger an AssertionError in the current implementation?

Wirg commented 4 years ago

Hi @henryre ,

I hope you're going well.

Small bump on this PR.

Wirg commented 4 years ago

Hi @henryre ,

Another bump for this pr.

What do you want me to do ?

Do you have some change ? Do you want to give up on this feature ?

henryre commented 4 years ago

Hi @Wirg, sorry for the delay here and thanks for the reminder! Taking a look today!

Wirg commented 4 years ago

@henryre

I finally changed the test. I am not fully satisfied. If tomorrow someone changes :

get_hashable to support pandas dataframe
memoize_key not to be used

The test won't fail.

henryre commented 4 years ago

@Wirg good thinking! You could add an additional field in the test called not_used and have different values for the two data points

Wirg commented 4 years ago

@henryre so I added a not_used int.

EDIT : nevermind. I rebased and those were fixed. Waiting for your review.

I am still encountering new test failures. I fixed F541 (f-string used with no parameters). I am facing a typing failure due to torch.nn.Linear usage in snorkel/slicing/utils. I am not sure what are my steps on this ? Is this already fixed and I should rebase ?

Wirg commented 4 years ago

@henryre small bump

I don't know what I should be doing regarding codecov. Am I expected to add more tests. If yes, where ?

henryre commented 4 years ago

@Wirg looks like codecov was being a bit temperamental, will go ahead and merge in. Thanks for your hard work here!

snorkel-team / snorkel

Feature/memoize key #1572

Description of proposed changes

Feature

Result

Implementation

Related issue(s)

Test plan

Checklist

Codecov Report