Refactor issue parser - lifecycle methods, use pydantic models, support reviewer lists

sneakers-the-rat commented 1 month ago

Fix: https://github.com/pyOpenSci/pyosMeta/issues/173

Wanted to do this for awhile, but i refactored the issue parser so it should be a bit easier to make modifications to it going forward.

Previously the field parsing logic was split in a bunch of different places, but this brings it into a linear set of lifecycle methods for parsing a review issue:

Currently:

start with a list of dictionaries
comment_to_list split whole issue body into a list of lists, split by \n and :. Separate out package name
get_issue_meta, iterate over list of lists with a fixed number of lines, processing key into snake_case, and handling special cases for different fields
clean_date_accepted_key a no op as far as i can tell?
get some date metadata from the passed dictionary
get categories
iterate over inline definition of expected keys, subsetting model dictionary
return dict of dicts

This PR:

Start with an Issue object (created from the github API's json schema) or a string (mostly for testing)
_split_header: Split the header from the body of the issue
split and strip body lines
_header_as_dict: Cast the header as a dictionary, preprocessing keys into snake case
_preprocess_meta: Apply any preprocessing steps to header metadata, split out into separable preprocessing methods (eg in this case by gathering old reviewer forms in _combine_reviewers)
_parse_field: Apply any special field parsing logic here, similar to get_issue_meta but only handling a single field at a time
_add_issue_metadata - add any requested fields from the Issue model
_postprocess_meta - do any postprocessing steps like gathering categories from the issue body
return ReviewModel

This gives clear places to add parsing logic, in a preprocessing, field processing, or postprocessing step. It also uses pydantic models instead of anonymous lists of lists, dicts of dicts, etc. so we can know what to expect at each of the stages. I noticed that a few of the methods were sort of awkward because they expected to operate over the whole list-of-lists of the '\n' and ': ' split issue body, so this separates things out a bit so they should be easier to maintain and test.

also:

Support both Reviewer 1:, Reviewer 2: ... and Reviewers: style specification of reviewers
removed need for passing a specific number of lines when parsing an issue by using the --- separator to split the header from the body first.
renamed return_response to get_issues bc i couldn't tell what was supposed to come out of it and parse_issue_header to parse_issues because it's plural (and created a parse_issue method to process just one)
unify handling of username lists where before they were special cased a little bit
added a tests/data directory for sample inputs so tests can be a bit less cluttered by defining them inline
fixed broken github token test by monkeypatching the .env file loader: the tests were failing in two ways for me on this: without a github token in my .env a bunch of tests failed expecting it to be there, and with one that test would fail because it was present but the monkeypatch only deleted the env variable but not the load dotenv function. I assume this is the case because the CI tests set a GITHUB_TOKEN env variable but not a .env file
fixed test dependencies - pytest-mock wasn't specified, not sure where that gets installed in CI but it wasn't when running locally.

sneakers-the-rat commented 1 month ago

i guess the thing that would be nice here re: https://github.com/pyOpenSci/pyopensci.github.io/pull/434 is single-sourcing the template: like ideally we would give the human readable names as aliases or title or something and make a classmethod for the ReviewModel that generates that review template, so changing the review template would be a change to the ReviewModel and there is no gap between them where it's possible to forget. Then we could have example reviews from each time the template changes to make sure we still support the current and all prior review checklist formats.

sneakers-the-rat commented 1 month ago

it looks like a number of fields were marked as optional in ReviewModel, presumably to be able to test with a reduced input set, but that causes us to improperly validate review issues with missing information.

sneakers-the-rat commented 1 month ago

it would b nice to have something like VCR to record all the website updates and put that in pytest for local testing bc rn i'm just using the CI like pytest since i don't have the GITHUB_TOKEN

sneakers-the-rat commented 1 month ago

ok there ya have it. since the model has a ton of optional params, i can't be sure that things are actually still working, but that's enough for one day. hopefully saves us some headaches in the future and makes it a little easier to tweak the reviewer template/speeds adoption of the pyos bot by making it easier to handle parsing multiple forms of review template

lwasser commented 1 month ago

it would b nice to have something like VCR to record all the website updates and put that in pytest for local testing bc rn i'm just using the CI like pytest since i don't have the GITHUB_TOKEN

@sneakers-the-rat what would that look like. i've used i think VCS (casettes?) in a package where we did api things but someone else set it up..

lwasser commented 1 month ago

@sneakers-the-rat this is working really well. There is one other issue. when i run update-review-teams it used to populate the github data from each user's github profile - specifically name given we always have ghusername. but now it doesn't seem to be popoulating that.

Given this is a big refactor should we handle that separtely? or is this a small enough thing to fix in this pr (along with the other tiny thing that is the label name (i commented on this).

thank you for this. I refactored this code from a much bigger mess to pydantic last year and i learned a TON. but i knew it was still really hard to maintain / debug. My brain isn't always great about knowing how to best refactor things !! so i really appreciate this pr.

the test issue idea in the _data folder is so smart. We can then add other examples of issues to process in the future as things come up.

lwasser commented 1 month ago

@all-contributors please add @sneakers-the-rat for code, review

allcontributors[bot] commented 1 month ago

@lwasser

I've put up a pull request to add @sneakers-the-rat! :tada:

sneakers-the-rat commented 1 month ago

There is one other issue. when i run update-review-teams it used to populate the github data from each user's github profile - specifically name given we always have ghusername. but now it doesn't seem to be popoulating that. ... Given this is a big refactor should we handle that separtely? or is this a small enough thing to fix in this pr (along with the other tiny thing that is the label name (i commented on this).

hmm mysterious. seems worth it to fix. let me see

sneakers-the-rat commented 1 month ago

@lwasser here try this - https://github.com/pyOpenSci/pyosMeta/pull/174/commits/8ff2ac35d7715f04375aff76fef3551e10439ebb

looks like the all_active_maintainers was special-cased in the body of that block that handles the username instead of using the role as a generic index. that block was also just duplicated from the single case, so i consolidated them into a single function - could def use some more work bc there's some stuff in there that i'm not sure what it's up to, but hopefully that fixes problem?

lwasser commented 1 month ago

@lwasser here try this - 8ff2ac3

looks like the all_active_maintainers was special-cased in the body of that block that handles the username instead of using the role as a generic index. that block was also just duplicated from the single case, so i consolidated them into a single function - could def use some more work bc there's some stuff in there that i'm not sure what it's up to, but hopefully that fixes problem?

Amazing. thank you. yes that CLI script is really really hacky. i really fought with how to create it in a easy-to-maintain and simple way. This is all running locally now. i'm going to merge and also i'm going to bump your permissions here Jonny.

The one item i'll open an issue about is the data-code-generator piece. I just want to document how that was done. thank you so so much!

pyOpenSci / pyosMeta

Refactor issue parser - lifecycle methods, use pydantic models, support reviewer lists #174