[Refactor] Tokenizer refactor continued

riedgar-ms commented 3 weeks ago

The next stage of refactoring Tokenizer; start having the product code conform to the tweaked Tokenizer class. Included in this:

Moving _joint_tokenize() onto the Tokenizer (as recode()), since it really belongs there
Making the Tokenizers on the 'grammarless' side of things conform to the same contract (i.e. deal with bytes not strings)
Added some MyPy annotations, and fixed a few issues

codecov-commenter commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 59.78261% with 37 lines in your changes missing coverage. Please review.

Project coverage is 60.35%. Comparing base (becc12d) to head (9e9b6df).

Files	Patch %	Lines
guidance/models/_grammarless.py	24.00%	19 Missing :warning:
guidance/models/transformers/_transformers.py	71.87%	9 Missing :warning:
guidance/models/_openai.py	33.33%	6 Missing :warning:
guidance/models/_azureai_studio.py	25.00%	3 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #894 +/- ## ========================================== + Coverage 56.99% 60.35% +3.35% ========================================== Files 64 64 Lines 4651 4674 +23 ========================================== + Hits 2651 2821 +170 + Misses 2000 1853 -147 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

riedgar-ms commented 3 weeks ago

Not quite sure what I did to get this fixed, but my local runs of the Azure OpenAI tests now pass. Will probably merge tomorrow, absent objections

Harsha-Nori commented 3 weeks ago

I really like moving the joint_tokenize into the Tokenizer class and the unification around the byte string contract! Encode/decode/recode is also a very nice paradigm. Love it

I'm admittedly more skeptical about the rest of the renaming/typing/re-formatting pieces here. Mypy feels like it might be causing more headaches than the documentation benefits it brings, and black's odd multi-lining is still a bit off. My preference is to err on the side of minimalism as much as we can. Perhaps we can open up a separate discussion on this? A bit of an aside, but I've really enjoyed Raymond Hettinger's approach/philosophy to these things (e.g. https://www.youtube.com/watch?v=wf-BqAjZb8M).

That said, back to the PR -- I'll reiterate that the core changes are really nice and definitely a step towards improving the Tokenizer/Model/Engine relationships here. Thanks for taking a stab at understanding and unifying this @riedgar-ms!

riedgar-ms commented 3 weeks ago

@Harsha-Nori are there particular bits of reformatting you'd like me to undo?

hudson-ai commented 3 weeks ago

@riedgar-ms Big +1 to the endcode/decode/recode approach here!

@Harsha-Nori re: formatting, type hints, etc... I agree that black sometimes makes things ugly. PEP 8 isn't a god we should all bow to. On that note, neither is mypy... +1 for Raymond Hettinger. He'll keep me company this morning while I drink my coffee and catch up on some things I missed this weekend.

On the mypy stuff -- I'm actually a big fan of type hinting. Python isn't and almost certainly never will be statically typed, but a bit of static type checking sprinkled on top can save so many headaches...

Slight philosophical digression, but you can solve a lot of physics problems with dimensional analysis alone. Back when I used to write a lot of torch and tensorflow, my first step when doing any kind of debugging was to "look at the shapes" (I have a *B x N x T tensor... I'm contracting it with a T x D tensor...). Reasoning about types similarly gives a nice skeleton for starting to reason about what some code is doing before you have to start thinking about the actual logic.

A lot of the engine code is pretty opaque at the moment, and I think that not having a "data model" to fall back on for reasoning about this stuff is making the problem worse than it needs to be. I think that most of the hints that @riedgar-ms added here are the helpful kind (encoding a byte string gives a sequence of ints, decoding a sequence of ints gives a byte string, recoding a sequence of ints gives a sequence of ints, ...).

That being said, sometimes I do find myself fighting against mypy. Annotations like tuple(int, tuple(int, Sequence[int], float), bool) don't provide any insight (but a lightweight NamedTuple data model can help in these situations...).

I'm rambling now. But TLDR: I think type hints are usually helpful, and I don't see anything egregious in this diff. @Harsha-Nori any particular use of typing that rubbed you the wrong way here?

riedgar-ms commented 3 weeks ago

That being said, sometimes I do find myself fighting against mypy. Annotations like tuple(int, tuple(int, Sequence[int], float), bool) don't provide any insight (but a lightweight NamedTuple data model can help in these situations...).

I would agree that annotations like that point to a missing type :-)

Harsha-Nori commented 3 weeks ago

Ah sorry, I was rambling myself earlier! It's less about any specific hints here -- these are all fine :) -- and more about anticipated fights with the build system and mypy hiccups if there's some detection or misspecification problem. I've heard from several people now that getting mypy to behave was a significant pain. Might just be a short term pain thing though!

riedgar-ms commented 3 weeks ago

So.... is that a signoff?

hudson-ai commented 3 weeks ago

Looks good to me -- just made a few little comments. Thanks for putting in work do disentangle these classes.

guidance-ai / guidance

[Refactor] Tokenizer refactor continued #894

Codecov Report