googleinterns / amaranth

Apache License 2.0
2 stars 0 forks source link

Split ML model helper functions into separate file #7

Closed tommylau-exe closed 4 years ago

tommylau-exe commented 4 years ago

Fixes #3

Creates new file: ml/amaranth_lib.py. Also includes tests for these helper functions in ml/test_amaranth_lib.py.

tommylau-exe commented 4 years ago

https://github.com/googleinterns/amaranth/blob/07bdbb6a36fc6345963c8aec31f1c2526ee60118/ml/amaranth_lib.py#L13-L22

Just realized this function isn't really testable. Takes a file path and returns a DataFrame? Will be re-thinking this one

tommylau-exe commented 4 years ago

https://github.com/googleinterns/amaranth/blob/07bdbb6a36fc6345963c8aec31f1c2526ee60118/ml/amaranth_lib.py#L13-L22

Just realized this function isn't really testable. Takes a file path and returns a DataFrame? Will be re-thinking this one

Fixed by splitting into two different functions:

https://github.com/googleinterns/amaranth/blob/afaff36dd6a7193f95192dc16db91d50178d961a/ml/amaranth_lib.py#L13

https://github.com/googleinterns/amaranth/blob/afaff36dd6a7193f95192dc16db91d50178d961a/ml/amaranth_lib.py#L37

This generalizes the implicit functionality present in the original function, and allows for more reusability: e.g. use with DataFrame's read from different file types (not just CSV). As a side-effect, actually reading the data is now the responsibility of the caller, but these functions can help format that data, which was the true intent all along

tommylau-exe commented 4 years ago

https://github.com/googleinterns/amaranth/blob/94b901a10bc16b5cda43c80853ee776611664834/ml/amaranth_lib.py#L148-L170

This function feels similar to read_calorie_data() from above, not straightforward to test. It does a lot of stuff, mostly with the help of keras. Will try to write some test for it though

tommylau-exe commented 4 years ago

https://github.com/googleinterns/amaranth/blob/94b901a10bc16b5cda43c80853ee776611664834/ml/amaranth_lib.py#L148-L170

This function feels similar to read_calorie_data() from above, not straightforward to test. It does a lot of stuff, mostly with the help of keras. Will try to write some test for it though

Realized that this line is particularly problematic:

https://github.com/googleinterns/amaranth/blob/94b901a10bc16b5cda43c80853ee776611664834/ml/amaranth_lib.py#L173

The keras.preprocessing.text.one_hot() function doesn't guarantee "unicity" aka uniqueness of it's mappings since it relies on hashing. This means it's result can contain collisions, particularly on small vocabularies, as you might find in a unit test. I can't say I'll completely stop using this function (it's quite useful and encodes in O(1) time), but I'll have to refactor it out of the helper functions library to be able to properly test.

tommylau-exe commented 4 years ago

https://github.com/googleinterns/amaranth/blob/94b901a10bc16b5cda43c80853ee776611664834/ml/amaranth_lib.py#L148-L170

This function feels similar to read_calorie_data() from above, not straightforward to test. It does a lot of stuff, mostly with the help of keras. Will try to write some test for it though

Realized that this line is particularly problematic:

https://github.com/googleinterns/amaranth/blob/94b901a10bc16b5cda43c80853ee776611664834/ml/amaranth_lib.py#L173

The keras.preprocessing.text.one_hot() function doesn't guarantee "unicity" aka uniqueness of it's mappings since it relies on hashing. This means it's result can contain collisions, particularly on small vocabularies, as you might find in a unit test. I can't say I'll completely stop using this function (it's quite useful and encodes in O(1) time), but I'll have to refactor it out of the helper functions library to be able to properly test.

Fixed by replacing add_input_labels() with this function instead:

https://github.com/googleinterns/amaranth/blob/737adebe010615df97234b03d89625f856d959c8/ml/amaranth_lib.py#L147

Padding a list was an important part of the original function that's easily testable. Encoding strings to integers is now also the responsibility of the caller, that way we can maintain speed and testability