Closed mivanit closed 2 months ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
see if any more utils should be moved to muutils
Except for all_instances
, I don't think any more of the utilities in utils.py which were newly added in this PR should be moved due to their more narrow applicability to mazes. There are a few older ones which could be moved. I didn't think you wanted to move elements of the old API if it could be helped, but if you're open to it here are my recommendations. I only checked for uses in the 2 libraries.
Function | Where Used | Recommendation |
---|---|---|
all_instances |
maze-dataset | Move |
bool_array_from_string |
maze-dataset, maze-transformer | Move |
adj_list_to_nested_set |
maze-dataset, maze-transformer | Maybe keep in m-d since it's at least lattice graph-specific |
apply_mapping |
Nowhere | Move |
apply_mapping_chain |
Nowhere | Move |
unclear if save_hashes is being properly parallelized :/
Idk a proper way to verify this. As something quick, I just ran a few trials on my new desktop with different numbers of processes
in the Pool
. Here are the wall clock times reported by the spinner
on list(pool.map(hash, all_tokenizers))
.
# Processes | Spinner Runtime [sec] |
---|---|
No Pool ; list(map(hash, all_tokenizers)) |
59 |
1 | 75 |
2 | 39 |
4 | 21 |
6 | 19 |
12 | 19 |
24 | 21 |
The multiprocessing is clearly helping compared to a single core, but idk why it plateaus so early. I tried using Pool.apply_async
, and that was way slower. IMO, runtime doesn't seem long enough to warrant sinking time into this now since neither of us seem to have the knowledge atm. This is the kind of backend optimization that can be done after the 1.0.0 release.
@mivanit I know we're wrapping this up, but I was thinking about a small change to the mark_as_unsupported
decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valued Literal
types to artificially restrict the space returned by all_instances
. You had mentioned before that ideally this artificial restriction would hide those subspaces from all_instances
but still allow users to access them by directly construct tokenizers with the desired parameters. The Literal
types allow that direct construction, but the current implementation of mark_as_unsupported
doesn't since the dummy abstract method precludes any construction of that class.
What if instead mark_as_unsupported
just overwrote that class's is_valid
method to just return False
all the time? That would make those classes still constructable. Calls to get_all_tokenizers
would still ignore them just the same, and direct use of all_instances
would ignore them if users use MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS
like the documentation asks them. We could maybe make MAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS
included by default in calls to all_tokenizers
so that users have to consciously opt out of those validation funcs.
I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.
@mivanit I know we're wrapping this up, but I was thinking about a small change to the
mark_as_unsupported
decorator. This shouldn't change any behavior in any tests. Right now we use that decorator as well as single-valuedLiteral
types to artificially restrict the space returned byall_instances
. You had mentioned before that ideally this artificial restriction would hide those subspaces fromall_instances
but still allow users to access them by directly construct tokenizers with the desired parameters. TheLiteral
types allow that direct construction, but the current implementation ofmark_as_unsupported
doesn't since the dummy abstract method precludes any construction of that class. What if insteadmark_as_unsupported
just overwrote that class'sis_valid
method to just returnFalse
all the time? That would make those classes still constructable. Calls toget_all_tokenizers
would still ignore them just the same, and direct use ofall_instances
would ignore them if users useMAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS
like the documentation asks them. We could maybe makeMAZE_TOKENIZER_MODULAR_DEFAULT_VALIDATION_FUNCS
included by default in calls toall_tokenizers
so that users have to consciously opt out of those validation funcs. I was thinking about it when considering if/how to talk about these restricted subspaces in the paper. Right now I'd probably just ignore them completely, but if we made this change I might mention them as experimental, untested features. Lmk what you think, and if you're good with it I can make a super quick commit to implement it.
You're right that the current implementation would require people to manually edit their locally installed version of the package to get the unsupported tokenizers to work -- I actually think this behavior is fine and serves as a "make sure you know what you're doing filer", but if you think this is a quick edit then I think it's fine to make as long as we add a warning when initializing an unsupported tokenizer.
TODOS:
(roughly in order of importance)
checks.yml
:all-tok-checks.yml
:[x] depends on #37 being merged into
tokenizer-overhaul-integration
first[x] depends on tests passing in
maze-transformer
#214[x]
make sure that tokenizer hash numpy file is shipped withmaze-dataset
packagemaybe make it optional, only iftokenization
extra is requested?[x] go over existing
# TODO
comments, see if anything is criticaltests/unit/maze_dataset/generation/test_maze_dataset.py
[x] clean up various superfluous warnings
[x] consider optimizations to
all_instances
andsave_hashes
save_hashes
is being properly parallelized :/[x] see if any more utils should be moved to muutils