Introduce & enforce Versioning for APIs

Malyuk-A commented 1 month ago

Short

The Oakestra APIs should have strong versioning to avoid unexpected breaking changes and easier extendability/modifiability.

Proposal

Rather self-explanatory. We need to add/make sure to have comprehensive & documented rules for using versioning and enforce their usage in our codebases.

Context: @giobart and I discussed the current state of our codebase and he mentioned that this is a major thing we should work on. @melkodary @TheDarkPyotr

Impact

Every component that exposes APIs, Documentation

Development time

Depends on how we split the effort. Including documentation, initial discussions for best practices to follow, etc. This can take a couple of weeks. The implementation itself is rather quick. (Figure out once, apply everywhere)

Status

Looking for discussion/feedback, best practices, and how to split the workload.

Checklist

[ ] Discussed
[ ] Documented
[ ] Implemented
[ ] Tested

TheDarkPyotr commented 1 month ago

Absolutely agree that addressing this issue is critical for further feature development

I think a starting point can be defining a standard for:

API Versioning: The least burdensome approaches are 1) URI versioning or 2) Header versioning. In any case a versioning policy should be defined based on:
1. Major Versions: changes that could disrupt system-wide functionality
2. Minor Versions: changes affecting specific component(s) or endpoints
3. Patch Versions: fixes and minor enhancements that should not disrupt functionality. Maybe using a {major}.{minor}.{patch} scheme can be a burden, depending on how often we change the APIs
Standardize Error Handling: Define and document a uniform error response structure, including:
1. Error Code: unique code identifying the error type
2. Human-Readable Message: clear description of the error
3. Context Information: any relevant context info that can contribute to determine the cause and allows to depict the internal status of the component (e.g., IP address, host configurations, SLA information, actual status vs. expected status, etc).
4. Error Propagation: define the extent to which the error must propagate internally before causing feature breaks or system-wide errors (I think this point is valid both for APIs but also for internal function errors). Also, if a call to an endpoint fail for a reason, the error should be visible and understandable on the involved component/endpoints on both side (e.g. /api/calculate/deploy/ failure reported by both cluster scheduler and system-manager components that manage the requests)
Standardize Response Format: Ensure responses are concise and include only the necessary data: avoid bloating JSON payloads by returning precisely the data required by the endpoint to minimize post-processing/extraction (e.g., in some components, the incoming JSON is extremely large but only partially used)
Testing: functional testing on endpoints, even if it covers only a single or a few "simple" cases, is crucial for ensuring the expected data format flowing into each component. If backward compatibility is maintained, test versioning on previous API versions should not be a problem.
Documentation: Maintain updated a single source of truth that describes the guidelines for adding and versioning new APIs, detailing the format and error codes

I think a way to proceed can be:

review and update the current documentation, starting from clear high-level system functionalities down to describe each component workflow
discuss if the proposed standard can be applied to each components one by one or refactor them to eliminate bad practices and understand and solve the cause of incosistent behaviors (as highlited by #331 )

On my side, I can start reviewing the root components and apply the described proposal. Of course will require a bit of coordination to whoever will do the same on cluster-level side.

Any ideas/standardization proposal/way to approach the issue is more than welcome! 👍

Edit: testing

Malyuk-A commented 1 month ago

That is a fantastic comment @TheDarkPyotr ! To be fair I think Giovanni & I were only talking about API versioning. Your comment covers a suite of necessary steps that include and go beyond versioning. I really appreciate your effort!

I totally agree with your assessment. I would use your comment as a list to generate further tickets instead of having everything in one place.

To add some more ideas to your list:

Python typing/type-hints would be massively helpful IMO for developers to figure out what is going on. (It is not great to read code where you do not know what exactly the input params are and what gets returned)
Regarding 2.i) I highly recommend using this instead of passing around the HTTP response codes as string or ints. Thus combined with type hints one can instantly know what this "status" object is. (I am currently looking into service statuses and oftentimes I do not know if this status object is an HTTP status, a service status, or some other status).
There are a lot of ideas and possible improvements here. The trick is not to get overwhelmed and do small steps of improvement one at a time. (Especially because most of the Oakestra devs are thesis students who can only put as much time into general codebase improvements as their circumstances allow.)

TheDarkPyotr commented 1 month ago

Thank you @Malyuk-A! 🙂 Totally agree that the proposal go a bit too much beyond the API versioning. I think that establish a broader "approach" (that hardly becomes effectively implemented in its entirely from the conception) can clarify the path to follow for new future components (on the long run, obtaining a small-scale document similar to this).

Yours are really great suggestions! 🔥

Absolutely agree that time and resources are limited so it's important to make these improvements manageable. Focusing on action-oriented points, starting simple maybe we can consider using:

As you suggested, Python typing would be a fantastic improvement in term of clarity: using it in combination with Pydantic should be a very good start (mostly towards points 2 and 3) to have a consistent object representation
- Flask-Rebar for minimal versioning of API

Malyuk-A commented 1 month ago

Yeah, I have become quite a big fan of Pydantic. I am using it in my FlOps extension and this makes life a lot easier and clearer. E.g. I do not have a convoluted SLA parser setup or DB initialization process, etc. I simply take the received user data and try to instantiate into the pydantic object I need and later based on it create automatically my nested tables in MongoDB. (e.g. FLOpsProject.model_validate(request_data)) and everything works out of the box (arbitrary/custom further checks can be easily added to the class via pydantic). So I truly understand what you are talking about. Just migrating oakestra to pydantic would already be a behemoth of a task. Let's mention this in today's (29.05.24) maintainer's meeting and decide with the team what action points we want to work on.

oakestra / oakestra