google / orbax

Orbax provides common checkpointing and persistence utilities for JAX users
https://orbax.readthedocs.io/
Apache License 2.0
296 stars 33 forks source link

Code duplication around tensorstore_spec logic in orbax-checkpoint #1241

Open minotru opened 3 weeks ago

minotru commented 3 weeks ago

Hi Orbax team,

I was looking at Orbax code at the latest version==0.7.0 and found that pieces of code with quite heavy logic around tensorstore_spec creation seem to contain duplicates.

I'd like to know if this code duplication intended by design or I am welcome to submit a PR.

Here get_tensorstore_spec is a part of public API, and I can't find any usage of get_tensorstore_spec by orbax-checkpoint itself https://github.com/google/orbax/blob/8b4e90d573082a5c7caa5f99c51db376f62a6995/checkpoint/orbax/checkpoint/serialization.py#L97C5-L124

And here is a very similar piece of code in build_kvstore_tspec in _internal package, and build_kvstore_tspec is used heavily by type_handlers.py https://github.com/google/orbax/blob/8b4e90d573082a5c7caa5f99c51db376f62a6995/checkpoint/orbax/checkpoint/_src/serialization/tensorstore_utils.py#L62-L135

Would you consider get_tensorstore_spec to reuse build_kvstore_tspec under the hood?


Also, there seems to be a bit of obscurity with default ts_context value.

So, TS_CONTEXT from serialization.py seems to be never used by common checkpoint IO code.

Should we somehow leave only 1 source of truth for default ts_context values?

cpgaffney1 commented 3 weeks ago

This code is currently in the process of a rework by @dicentra13. get_tensorstore_spec in serialization.py should be either removed or should reuse build_kvstore_tspec. I think it is only used in a couple places in internal code - just due slower progress in refactoring rather than any inherent need.

Again, async_serialize is used in one place internally that I'm working on eliminating, although probably will leave that function in place as a wrapper.

Agreed that ts_context should use one source of truth, that could use fixing.

minotru commented 3 weeks ago

Got it @cpgaffney1 , thank for the reply! Waiting for the refactoring by @dicentra13 :)

Let's close this issue once refactoring is completed and merged