Closed riedgar-ms closed 3 weeks ago
Attention: Patch coverage is 59.78261%
with 37 lines
in your changes missing coverage. Please review.
Project coverage is 60.35%. Comparing base (
becc12d
) to head (9e9b6df
).
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Not quite sure what I did to get this fixed, but my local runs of the Azure OpenAI tests now pass. Will probably merge tomorrow, absent objections
I really like moving the joint_tokenize into the Tokenizer class and the unification around the byte string contract! Encode/decode/recode is also a very nice paradigm. Love it
I'm admittedly more skeptical about the rest of the renaming/typing/re-formatting pieces here. Mypy feels like it might be causing more headaches than the documentation benefits it brings, and black's odd multi-lining is still a bit off. My preference is to err on the side of minimalism as much as we can. Perhaps we can open up a separate discussion on this? A bit of an aside, but I've really enjoyed Raymond Hettinger's approach/philosophy to these things (e.g. https://www.youtube.com/watch?v=wf-BqAjZb8M).
That said, back to the PR -- I'll reiterate that the core changes are really nice and definitely a step towards improving the Tokenizer/Model/Engine relationships here. Thanks for taking a stab at understanding and unifying this @riedgar-ms!
@Harsha-Nori are there particular bits of reformatting you'd like me to undo?
@riedgar-ms Big +1 to the endcode/decode/recode approach here!
@Harsha-Nori re: formatting, type hints, etc... I agree that black sometimes makes things ugly. PEP 8 isn't a god we should all bow to. On that note, neither is mypy... +1 for Raymond Hettinger. He'll keep me company this morning while I drink my coffee and catch up on some things I missed this weekend.
On the mypy stuff -- I'm actually a big fan of type hinting. Python isn't and almost certainly never will be statically typed, but a bit of static type checking sprinkled on top can save so many headaches...
Slight philosophical digression, but you can solve a lot of physics problems with dimensional analysis alone. Back when I used to write a lot of torch and tensorflow, my first step when doing any kind of debugging was to "look at the shapes" (I have a *B x N x T tensor... I'm contracting it with a T x D tensor...). Reasoning about types similarly gives a nice skeleton for starting to reason about what some code is doing before you have to start thinking about the actual logic.
A lot of the engine code is pretty opaque at the moment, and I think that not having a "data model" to fall back on for reasoning about this stuff is making the problem worse than it needs to be. I think that most of the hints that @riedgar-ms added here are the helpful kind (encoding a byte string gives a sequence of ints, decoding a sequence of ints gives a byte string, recoding a sequence of ints gives a sequence of ints, ...).
That being said, sometimes I do find myself fighting against mypy. Annotations like tuple(int, tuple(int, Sequence[int], float), bool)
don't provide any insight (but a lightweight NamedTuple
data model can help in these situations...).
I'm rambling now. But TLDR: I think type hints are usually helpful, and I don't see anything egregious in this diff. @Harsha-Nori any particular use of typing that rubbed you the wrong way here?
That being said, sometimes I do find myself fighting against mypy. Annotations like
tuple(int, tuple(int, Sequence[int], float), bool)
don't provide any insight (but a lightweightNamedTuple
data model can help in these situations...).
I would agree that annotations like that point to a missing type :-)
Ah sorry, I was rambling myself earlier! It's less about any specific hints here -- these are all fine :) -- and more about anticipated fights with the build system and mypy hiccups if there's some detection or misspecification problem. I've heard from several people now that getting mypy to behave was a significant pain. Might just be a short term pain thing though!
So.... is that a signoff?
Looks good to me -- just made a few little comments. Thanks for putting in work do disentangle these classes.
The next stage of refactoring
Tokenizer
; start having the product code conform to the tweakedTokenizer
class. Included in this:_joint_tokenize()
onto theTokenizer
(asrecode()
), since it really belongs thereTokenizer
s on the 'grammarless' side of things conform to the same contract (i.e. deal with bytes not strings)