understanding-search / maze-dataset

maze datasets for investigating OOD behavior of ML systems
16 stars 3 forks source link

Tokenizer overhaul (complete) #38

Closed mivanit closed 2 months ago

mivanit commented 5 months ago

TODOS:

(roughly in order of importance)

review-notebook-app[bot] commented 5 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

aaron-sandoval commented 3 months ago

see if any more utils should be moved to muutils

Except for all_instances, I don't think any more of the utilities in utils.py which were newly added in this PR should be moved due to their more narrow applicability to mazes. There are a few older ones which could be moved. I didn't think you wanted to move elements of the old API if it could be helped, but if you're open to it here are my recommendations. I only checked for uses in the 2 libraries.

Function Where Used Recommendation
all_instances maze-dataset Move
bool_array_from_string maze-dataset, maze-transformer Move
adj_list_to_nested_set maze-dataset, maze-transformer Maybe keep in m-d since it's at least lattice graph-specific
apply_mapping Nowhere Move
apply_mapping_chain Nowhere Move
aaron-sandoval commented 3 months ago

unclear if save_hashes is being properly parallelized :/

Idk a proper way to verify this. As something quick, I just ran a few trials on my new desktop with different numbers of processes in the Pool. Here are the wall clock times reported by the spinner on list(pool.map(hash, all_tokenizers)).

# Processes Spinner Runtime [sec]
No Pool; list(map(hash, all_tokenizers)) 59
1 75
2 39
4 21
6 19
12 19
24 21

The multiprocessing is clearly helping compared to a single core, but idk why it plateaus so early. I tried using Pool.apply_async, and that was way slower. IMO, runtime doesn't seem long enough to warrant sinking time into this now since neither of us seem to have the knowledge atm. This is the kind of backend optimization that can be done after the 1.0.0 release.

aaron-sandoval commented 2 months ago

@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.

mivanit commented 2 months ago

@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.

You're right that the current implementation would require people to manually edit their locally installed version of the package to get the unsupported tokenizers to work -- I actually think this behavior is fine and serves as a "make sure you know what you're doing filer", but if you think this is a quick edit then I think it's fine to make as long as we add a warning when initializing an unsupported tokenizer.

mivanit commented 2 months ago

making a note for profiling and figuring out how many tokenizers we can test

Time (MM:SS) Time (s) n tokenizers tested Link
9:54 594 100 action
10:05 605 300 action
11:15 675 1000 action
13:33 813 3000 action
16:41 1001 5000 action

image