mivanit commented 5 months ago

TODOS:

(roughly in order of importance)

checks.yml:
all-tok-checks.yml:
[x] depends on #37 being merged into tokenizer-overhaul-integration first
[x] depends on tests passing in maze-transformer #214
- PR: https://github.com/understanding-search/maze-transformer/pull/214
- PR checks status:
[x] ~~make sure that tokenizer hash numpy file is shipped with maze-dataset package~~
- [x] ~~maybe make it optional, only if tokenization extra is requested?~~
- [x] no need to ship this, its only required for tests
[x] go over existing # TODO comments, see if anything is critical
- [x] parametrize some tests in tests/unit/maze_dataset/generation/test_maze_dataset.py
[x] clean up various superfluous warnings
[x] consider optimizations to all_instances and save_hashes
- [x] unclear if save_hashes is being properly parallelized :/
[x] see if any more utils should be moved to muutils

review-notebook-app[bot] commented 5 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

aaron-sandoval commented 3 months ago

see if any more utils should be moved to muutils

Except for all_instances, I don't think any more of the utilities in utils.py which were newly added in this PR should be moved due to their more narrow applicability to mazes. There are a few older ones which could be moved. I didn't think you wanted to move elements of the old API if it could be helped, but if you're open to it here are my recommendations. I only checked for uses in the 2 libraries.

Function	Where Used	Recommendation
`all_instances`	maze-dataset	Move
`bool_array_from_string`	maze-dataset, maze-transformer	Move
`adj_list_to_nested_set`	maze-dataset, maze-transformer	Maybe keep in m-d since it's at least lattice graph-specific
`apply_mapping`	Nowhere	Move
`apply_mapping_chain`	Nowhere	Move

aaron-sandoval commented 3 months ago

unclear if save_hashes is being properly parallelized :/

Idk a proper way to verify this. As something quick, I just ran a few trials on my new desktop with different numbers of processes in the Pool. Here are the wall clock times reported by the spinner on list(pool.map(hash, all_tokenizers)).

# Processes	Spinner Runtime [sec]
No `Pool`; `list(map(hash, all_tokenizers))`	59
1	75
2	39
4	21
6	19
12	19
24	21

The multiprocessing is clearly helping compared to a single core, but idk why it plateaus so early. I tried using Pool.apply_async, and that was way slower. IMO, runtime doesn't seem long enough to warrant sinking time into this now since neither of us seem to have the knowledge atm. This is the kind of backend optimization that can be done after the 1.0.0 release.

aaron-sandoval commented 2 months ago

@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.

mivanit commented 2 months ago

@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal types to artificially restrict the space returned by all_instances. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal types allow that direct construction, but the current implementation of mark_as_unsupported doesn't since the dummy abstract method precludes any construction of that class. What if instead mark_as_unsupported just overwrote that class's is_valid method to just return False all the time? That would make those classes still constructable. Calls to get_all_tokenizers would still ignore them just the same, and direct use of all_instances would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS included by default in calls to all_tokenizers so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.

You're right that the current implementation would require people to manually edit their locally installed version of the package to get the unsupported tokenizers to work -- I actually think this behavior is fine and serves as a "make sure you know what you're doing filer", but if you think this is a quick edit then I think it's fine to make as long as we add a warning when initializing an unsupported tokenizer.

mivanit commented 2 months ago

making a note for profiling and figuring out how many tokenizers we can test

Time (MM:SS)	Time (s)	n tokenizers tested	Link
9:54	594	100	action
10:05	605	300	action
11:15	675	1000	action
13:33	813	3000	action
16:41	1001	5000	action

understanding-search / maze-dataset

Tokenizer overhaul (complete) #38

TODOS: