Refactoring Jubilee - Githubissues

ssarrafan commented 2 years ago

Strategic refactoring tasks will be added here.

elais commented 2 years ago

there are several opportunities for enhancement of the existing code base that were raised both in the review and outside of it; particularly around code coverage and unit testing. This is my proposed task list for things to do. It seems like a lot but most of it is moving files around.

Code Coverage

[ ] Add Coverage.py to the project
[ ] Add pre-commit hook to lint and run coverage tool on changed files

Data Access Layer

[ ] Create top level infrastructure module

Databases

[ ] Create infrastructure/db module for DB access
[ ] Create concrete mongodb access module

Task Management

[ ] Create infrastructure/tasks modules task management access
[ ] Create concrete dagster access module
[ ] Move operations and graphs to dagster access module

CRUD

[ ] Create dedicated CRUD module
[ ] Create CRUD submodules for operations on our logical domains

API

[ ] Version new endpoints

api/endpoints/metadata.py

[ ] Remove direct dependency on mongo
[ ] Remove direct dependency on dagster
[ ] Remove direct dependency on starlette
[ ] Move CRUD operations to module

api/endpoints/runs.py

[ ] Remove direct dependency on mongo
[ ] Remove direct dependency on starlette
[ ] Move CRUD operations to module

dwinston commented 2 years ago

I want to set an objective function for all of our nmdc-runtime refactoring work, where, for every proposed total change (i.e. reviewable pull request), we seek to maximize the change in this function. Here's my proposed function:

Good (numerator stuff):

Percent Lines of Code (LOC) covered by a test. Sublinear growth here, i.e. diminishing returns on "getting to 100%". Coverage.py will be fine here.
Percent "public" units, i.e. non-underscored module importables - functions, classes, constants, variables/objects, covered by tutorial/how-to-guide/explanation (so excluding reference) docs. Maximizing runtime consumers' ability to understand functionality without have to dive into the codebase. Prior art on measuring this here.
Percent LOC unaffected by state (i.e. avoiding getting values from or calling methods on long-living references). Pure-functional code is easier to reason about (e.g. via a simple substitution mode of execution) and thus more maintainable. My strategy for measuring this would be to designate certain (sub)packages/modules as purely functional.

Bad (denominator stuff):

total LOC

@elais @shreddd what do you think?

elais commented 2 years ago

TL;DR Let's shoot for 80%

I think the function is a complex way of laying out a few practical rules for dealing with coverage and testing more broadly. What I would propose, unfortunately in English, is the following:

80% code coverage as a goal for all new code added, while adding unit tests to legacy (code made before tests, not obsolete or useless code) functions as the need arises due to discovered bugs or breaking changes.
Focus tests on interfaces (which I believe is what you call public units) i.e priority is given to exported functions. On the flipside this usually gets means coding interfaces first.
Abstract away state like database connections, task runner actions, etc.

As far as percent lines of code goes, generally we should see the percent lines of code go up as we simplify the interfaces we code to. For instance if we have

funcFoo(val):
    result = funcBar(val)
    return funcBaz(result)

A single test on funcFoo() will include funcBar() and funcBaz() in its coverage results, so the total LOC covered by this test will count the total lines of code for all three functions combined except branches and exceptions. To test each branch and exception would require separate unit tests but they can all be written in the context of funcFoo(). This is all to say that % coverage will grow sublinearly but having a static goal here of 80% at the outset will generally cover the happy path and the last 20% can be handled as bugs and regressions are discovered. Rather than a function we can just ask "is 80% of this procedure's... functionality covered?".

This ties indirectly into my desire to abstract out our data access layers. If we don't use pymongo directly we will be able to inject structures that represent the db connector and spit out fake data. This is fine for unit tests on CRUD operations (which will be the vast majority of our operations) because generally we want to know from a CRUD function is if given value A does it call all the things we expect it to call and do they manipulate the results in the way we expect them to. Testing live databases and if they accept and return the expected results is another class of tests that usually even if we ignore them entirely we can still hit 80% fairly easily by using the interface we have to those databases in multiple functions.

dwinston commented 2 years ago

Also, I propose an application/framework/library split, i.e. a disaggregation of repo code into three pip-publishable packages:

nmdc_runtime/app/: The application. Defines interactions with the environment and effects on the data model. Plugs itself into the framework. Plugs library elements into each other. If published as a package (optional), published with calendar versioning.

nmdc_runtime/frame/: The framework. A set of patterns. Facilitates reuse of design. The app imports framework code for inversion of control, so that the framework will manage calls. Also plugs library elements into each other. Published as a package with semantic versioning.

nmdc_runtime/lib/: The library. A set of functions and classes. Facilitates reuse of code. Published as a package with semantic versioning.

I would like to see the largest fraction of repo code in lib and the smallest fraction in app.

dwinston commented 2 years ago

I do like the idea of setting a minimum bar (and I like 80%) on test coverage across all new "public" importable units. I don't want new stuff added with poor coverage but the overall number goes up because of significant improvements to legacy code.

dwinston commented 2 years ago

I agree about abstracting data access layers. This will push code down the stack from app to frame and from frame to lib -- it will increase the fraction of code that is state-free / that we inject state/config into at the app level.

elais commented 2 years ago

I do like the idea of setting a minimum bar (and I like 80%) on test coverage across all new "public" importable units. I don't want new stuff added with poor coverage but the overall number goes up because of significant improvements to legacy code.

I forgot to agree with this in my last post. Testing "public" importable units is usually good enough, because the private unit invocation inside of the public unit function call will mark that private unit as covered by the coverage tool. At least it should, I've yet to encounter a coverage tool that acted differently.

Also, I propose an application/framework/library split, i.e. a disaggregation of repo code into three pip-publishable packages

I'm fine with this but in a former life I saw these folders usually named app, api, and infra/db, where the app folder contained the crud operations, infra contained repositories and database implementations, and api contained simply the api. In practice the api module would require from the app module and the app module would require repositories from the infra module. But this is fine, separation of concerns is a positive.

dwinston commented 2 years ago

where the app folder contained the crud operations, infra contained repositories and database implementations, and api contained simply the api. In practice the api module would require from the app module and the app module would require repositories from the infra module.

Mapping your (api,app,infra) to (api,crud,data), and positioning all as application-specific concerns, how about the following structure, where the naming yields the right alphabetical order from top/surface layer to bottom/core layer:

nmdc_runtime/app/api
nmdc_runtime/app/crud
nmdc_runtime/app/data
nmdc_runtime/frame
nmdc_runtime/lib

elais commented 2 years ago

This looks good but what is the purpose of the framework module here? Could you give a concrete example of the pattern that should go there?

elais commented 2 years ago

Another thing to be wary of is changing the folder structure will break the nmdc_runtime namespace. Does PyCharm have a way of refactoring all files within the project to reflect new folder structures with minimal effort? I can do it with some shell magic but if there's something dedicated in your IDE I'd say go ahead and create the structure sooner rather than later.

dwinston commented 2 years ago

This looks good but what is the purpose of the framework module here? Could you give a concrete example of the pattern that should go there?

Anything stateful that is not specific to the deployed (including env concerns, e.g. for dev/staging/prod) NMDC Runtime application app should go in the framework package nmdc_runtime.frame. This is so that NMDC Sites other than the Runtime site can reuse patterns for (meta)data workflow orchestration.

Example: for the following (from your proposal above),

Task Management

[ ] Create infrastructure/tasks modules task management access

[ ] Create concrete dagster access module

[ ] Move operations and graphs to dagster access module

the first two should go in frame, whereas the last item should go in app (at least, the ops/graphs that are tailored to the NMDC Runtime site -- more "general" ops should go in frame).

dwinston commented 2 years ago

Another thing to be wary of is changing the folder structure will break the nmdc_runtime namespace. Does PyCharm have a way of refactoring all files within the project to reflect new folder structures with minimal effort? I can do it with some shell magic but if there's something dedicated in your IDE I'd say go ahead and create the structure sooner rather than later.

Yes, PyCharm has "Refactor > Move File" functionality that speaks Python and so can update imports across the repo when moving a module. I can take care of this.

elais commented 2 years ago

Ok this makes a lot of sense I’m 100% behind it. On Jul 8, 2022, 06:20 -0600, Donny Winston @.***>, wrote:

This looks good but what is the purpose of the framework module here? Could you give a concrete example of the pattern that should go there? Anything stateful that is not specific to the deployed (including env concerns, e.g. for dev/staging/prod) NMDC Runtime application app should go in the framework package nmdc_runtime.frame. This is so that NMDC Sites other than the Runtime site can reuse patterns for (meta)data workflow orchestration. Example: for the following (from your proposal above), Task Management

• Create infrastructure/tasks modules task management access • Create concrete dagster access module • Move operations and graphs to dagster access module

the first two should go in frame, whereas the last item should go in app (at least, the ops/graphs that are tailored to the NMDC Runtime site -- more "general" ops should go in frame). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

ssarrafan commented 2 years ago

Good morning @elais. Tomorrow is the last day of this sprint. Will this issue be rolling into next month? Would you like to break this issue up into smaller issues? Thanks for your help.

elais commented 2 years ago

We will be able to close this tomorrow. I won't start the merge until next week (I'm not feeling well) but the branch will have all the code pushed. On Jul 28, 2022, 10:30 -0600, ssarrafan @.***>, wrote:

Good morning @elais. Tomorrow is the last day of this sprint. Will this issue be rolling into next month? Would you like to break this issue up into smaller issues? Thanks for your help. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

ssarrafan commented 2 years ago

We will be able to close this tomorrow. I won't start the merge until next week (I'm not feeling well) but the branch will have all the code pushed. … On Jul 28, 2022, 10:30 -0600, ssarrafan @.>, wrote: Good morning @elais. Tomorrow is the last day of this sprint. Will this issue be rolling into next month? Would you like to break this issue up into smaller issues? Thanks for your help. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.>

Great news! If you don't get around to closing it tomorrow I'll move it to August so you can do the merge next week. Thanks.

elais commented 2 years ago

Merged refactor.

microbiomedata / nmdc-runtime

Refactoring Jubilee #122

Code Coverage

Data Access Layer

Databases

Task Management

CRUD

API

api/endpoints/metadata.py

api/endpoints/runs.py

Task Management