snakemake / snakefmt

The uncompromising Snakemake code formatter
MIT License
151 stars 29 forks source link

fix: support python3.12 f-string tokenizing #213

Closed mbhall88 closed 9 months ago

mbhall88 commented 10 months ago

Python 3.12 now formalises f-strings by parsing them as their own entities, rather than plain strings. This creates some problems for us as where a whole f-string would produce a single tokenize.STRING, it now breaks f-strings up into bits and adds the new tokens FSTRING_START, FSTRING_MIDDLE and FSTRING_END.

The fixes in this PR relate to issues people have already come across. I suspect there will be more. My solution is somewhat hacky in places but it was the best I could do given the tokens the tokenize library is giving us. #207 is particularly problematic as the tokenizer is the one that removes the extra braces, hence why I am manually adding them back. Fingers crossed this doesn't raise further issues.

I have also added python 3.12 to the CI so we are testing across <3.12 and >=3.12

Closes #207 and closes #210

codecov[bot] commented 10 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (a80fa26) 97.73% compared to head (62bca59) 97.76%.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/snakemake/snakefmt/pull/213/graphs/tree.svg?width=650&height=150&src=pr&token=Mluyxeq6Kh&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=snakemake)](https://app.codecov.io/gh/snakemake/snakefmt/pull/213?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=snakemake) ```diff @@ Coverage Diff @@ ## master #213 +/- ## ========================================== + Coverage 97.73% 97.76% +0.02% ========================================== Files 12 12 Lines 1017 1028 +11 Branches 223 228 +5 ========================================== + Hits 994 1005 +11 Misses 12 12 Partials 11 11 ``` | [Flag](https://app.codecov.io/gh/snakemake/snakefmt/pull/213/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=snakemake) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/snakemake/snakefmt/pull/213/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=snakemake) | `97.66% <90.90%> (-0.08%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=snakemake#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/snakemake/snakefmt/pull/213?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=snakemake) | Coverage Δ | | |---|---|---| | [snakefmt/parser/syntax.py](https://app.codecov.io/gh/snakemake/snakefmt/pull/213?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=snakemake#diff-c25ha2VmbXQvcGFyc2VyL3N5bnRheC5weQ==) | `98.91% <100.00%> (+0.04%)` | :arrow_up: |
bricoletc commented 10 months ago

Second comment: echoing my review of #211, do you want to have a chat about a formal lexer/parser for snakefmt?

mbhall88 commented 10 months ago

Thanks for fixing these v. important bugs @mbhall88 A bit unfortunate for us that python 3.12 is changing f-string tokenisation

I'm just going to try to summarise in words what I understand from this PR but also what I don't understand from snakefmt (:cry:) and what I think we might need to do specifically here:

* You're adding 'space addition' through function `add_token_space` by registering 'FSTRING_START' to the `spacing_triggers` dict; why can't we run f-strings through `black` again?

We can run them through black, but they get mangled by snakefmt's parser now before they go into black sadly.

* If we **can** run f-strings through black, can't we try and run `process_token` from encountering 'FSTRING_START' to encountering 'FSTRING_END' as per the new f-string [token sequence](https://peps.python.org/pep-0701/#new-tokens) ?  I'm thinking of something like lines 394 and 396, where if we're `in_brackets` or `in_lambda` we keep parsing, instead of triggering snakefmt parameter resolution. Here we'd similarly ask if we're `in_fstring`
  BUT it seems that you're having to introduce parse 'FSTRING_MID's individually (lines 335-341) and add braces because they get lost?

That is an interesting idea. Not sure I completely understand how we would do this though as I think this section of the code was your creation and I'm not as familiar with it. Feel free to have a go at adding an in_fstring variant.

Yes, I had to introduce them as tokenize automatically removes them, so we only get a single brace when doubles are used. They get returned in their surrounding strings too, not even as a single token which made it slightly more tricky.

So in fact IIUC, tokenizer gets rid of '{' and '}' inside f-strings, so we have to re-add them manually? And thus, we can't just try and 'parse through' an f-string from 'FSTRING_START' to 'FSTRING_END' because that way we lose '{' and '}' in f-strings?

Essentially yes, this is the problem I ran into. Another option I had thought of was once we hit FSTRING_START we greedily take tokens until FSTRING_END and we use the positions of those tokens in the original string (I assume we get that?) to pull out the whole f-string and make a STRING token out of it. However, f-strings have added functionality in python3.12 now that is compatible with 3.11 - though I'm not sure if this would be a problem with passing the string to black or not...

mbhall88 commented 10 months ago

Second comment: echoing my review of #211, do you want to have a chat about a formal lexer/parser for snakefmt?

See my message on slack

bricoletc commented 9 months ago

Replying to your reply https://github.com/snakemake/snakefmt/pull/213#issuecomment-1882079723, I think the solution you've implemented is currently the best - it's actually v. succinct And I'm not a fan, in general, of extracting chunks of the original text using pointers stored in the tokens - we've avoided doing this so far IIRC And my idea of trying to parse through the whole f-string and passing it to black will not work because of the loss of the double curly brackets; incidentally, why do they now do that?? It's a pain for us

So this is good to merge, thanks @mbhall88 !