hotosm / fmtm

Field Mapping Tasking Manager - coordinated field mapping.
https://fmtm.hotosm.org/
GNU Affero General Public License v3.0
43 stars 44 forks source link

Redesign project/task/history schemas and relationships to better align with event-driven programming #1610

Open spwoodcock opened 3 months ago

spwoodcock commented 3 months ago

Problem

The key unit of change in tasking managers is the task status / events.

Currently we have:

DB Redesign

Related to other issues in this milestone: https://github.com/hotosm/fmtm/milestone/39

task_events table

CREATE TABLE public.task_events (
    event_id UUID,
    project_id integer,
    task_id integer,
    user_id integer,
    state TaskState,
    comment text,
    created_at timestamp without time zone DEFAULT timezone('utc'::text, now()),
    PRIMARY KEY (event_id),
    FOREIGN KEY (project_id) REFERENCES public.projects (project_id),
    FOREIGN KEY (project_id, task_id) REFERENCES public.tasks (project_id, task_id)
    FOREIGN KEY (user_id) REFERENCES public.users (user_id),
);
-- See details on TaskState enum below

Note I made the event_id a UUID so it is a universal ID, across multiple databases / environments, and is more compatible with other database systems. It should probably be generated in the Python code before creating the DB object?

Note 2: adding the project_id may seem redundant, as task_id is unique, but it allows for more effective indexing. Tasks are divided up by project.

Better indexing

-- Create a better composite index for public.tasks
CREATE INDEX tasks_composite ON public.tasks USING btree (task_id, project_id);

-- Existing indexes from task_history
CREATE INDEX idx_task_event_composite ON public.task_events USING btree (task_id, project_id);
CREATE INDEX idx_task_event_project_id_user_id ON public.task_events USING btree (user_id, project_id);
CREATE INDEX idx_task_event_project_id ON public.task_events USING btree (project_id);
CREATE INDEX idx_task_event_user_id ON public.task_events USING btree (user_id);

-- Add additional index for event date
CREATE INDEX idx_task_event_created_at ON task_events USING btree (created_at);

Refactor enums

Events via the API:

class EventType(str, Enum):
    """Events that can be used via the API to update a state

    Specify the event type for ``POST`` to:
    ``/project/{pid}/event``
    or for Entities endpoints too.

    Possible values are:

    - ``map`` -- Set to *locked for mapping*, i.e. mapping in progress.
    - ``finish`` -- Set to *unlocked to validate*, i.e. is mapped.
    - ``validate`` -- Request recent task ready to be validate.
    - ``good`` -- Set the state to *unlocked done*.
    - ``bad`` -- Set the state *unlocked to map* again, to be mapped once again.
    - ``split`` -- Set the state *unlocked done* then generate additional subdivided task areas.
    - ``assign`` -- For a requester user to assign a task to another user. Set the state *locked for mapping* passing in the required user id.
    - ``comment`` -- Keep the state the same, but simply add a comment.

    Note that ``task_id`` must be specified in the endpoint too.
    """
    MAP = "map"
    FINISH = "finish"
    VALIDATE= "validate"
    GOOD = "good"
    BAD = "bad"
    SPLIT = "split"
    ASSIGN= "assign"
    COMMENT = "comment"

and the available states of an object in the db:

class State(int, Enum):
    """The state of an object (task or entity).

    The state can be:
    - ``unlocked to map``
    - ``locked for mapping``
    - ``unlocked to validate``
    - ``locked for validation``
    - ``unlocked done``
    """
    UNLOCKED_TO_MAP = 0
    LOCKED_FOR_MAPPING = 1
    UNLOCKED_TO_VALIDATE = 2
    LOCKED_FOR_VALIDATION = 3
    UNLOCKED_DONE = 4

Updating state via the API

Models

class NewEvent(BaseModel):
    """DTO for ``POST`` request to ``/project/{pid}/events`` endpoint.

    task_id must be set, as this always relates to a task.
    The EventType enum provides available options.
    """
    event: EventType
    task_id: conint(ge=1)
    requested_user_id: Optional[conint(ge=1)]
class TaskState(BaseModel):
    """DTO for ``GET`` request to ``/project/{pid}/task/history``.

    And also for ``/project/{pid}/task/{tid}/history``.
    """
    task_id: conint(ge=1)
    user_id: conint(ge=1)
    profile_img: str
    state: State
    comment: str
    date: datetime

Endpoints

Note if using a local-first paradigm, this logic could be implemented on the frontend instead.

Endpoint to update task state via events:

# modify this to use case/when!

@app.post("/project/{project_id}/event")
async def new_event(project_id: conint(ge=1), user_id=Depends(login_required), detail: NewEvent):
    if detail.event == EventType.MAP:
        updated = await _db_api().lock_for_mapping(project_id, user_id, detail.task_id)
    elif detail.event == EventType.FINISH:
        updated = await _db_api().finish(project_id, user_id, detail.task_id)
    elif detail.event == EventType.VALIDATE:
        updated = await _db_api().lock_for_validation(project_id, user_id, detail.task_id)
    elif detail.event == EventType.GOOD:
        updated = await _db_api().good(project_id, user_id, detail.task_id)
    elif detail.event == EventType.BAD:
        updated = await _db_api().bad(project_id, user_id, detail.task_id)
    elif detail.event == EventType.SPLIT:
        updated = await _db_api().split(project_id, user_id, detail.task_id)
    elif detail.event == EventType.ASSIGN:
        updated = await _db_api().assign(project_id, detail.requested_user_id, detail.task_id)
    elif detail.event == EventType.COMMENT:
        updated = await _db_api().comment(project_id, user_id, detail.task_id)
    if not updated:
        raise fastapi.HTTPException(
            status_code=400,
            detail=f"Failed to update task ({detail.task_id}) with event ({detail.event})")
    # 201 response as event is created
    return HTTPResponse(status=201)

Database logic

async def finish(pid, uid, tid) -> None:
    r = await db.sq(
        """
        WITH last AS (
            SELECT *
            FROM public.task_events
            WHERE project_id=$1 AND task_id=$2
            ORDER BY event_id DESC
            LIMIT 1),
        locked AS (
            SELECT *
            FROM last
            WHERE user_id=$3 AND state='LOCKED_FOR_MAPPING'
        )
        INSERT INTO public.task_events(project_id, task_id, user_id, event, comment)
        SELECT $1, $2, $3, 'UNLOCKED_TO_VALIDATE', 'Note: Mapping finished',
        FROM last
        WHERE user_id=$3
        RETURNING project_id, task_id, user_id
        """,
        project_id,
        task_id,
        user_id)
    assert 1 == len(r)
    assert r[0]["project_id"] == pid
    assert r[0]["task_id"] == tid
    assert r[0]["user_id"] == uid
    return None

In this flow, the task_events are a centralised data store where each event is stored as a sequence of immutable records, also called a log. Each new event is appended to the end of the list. This is the single source of truth for the task state, which can be accessed from multiple places.

manjitapandey commented 3 months ago

@spwoodcock this would really be a great to improve FMTM from TM cross learning session. Liked your idea of partitioning of task history table

manjitapandey commented 2 months ago

@spwoodcock are we still on for partitioning or we can have further discussion to reach to some strong decision of whether to partition the tables or archiving the very old data would be preferable.

spwoodcock commented 2 months ago

For TM or FMTM?

No to partitioning for both I think.

But we could archive old data for TM 👍 FMTM doesn't have any old data to archive yet!

We could maybe consider an archive ability in the UI, but I think a better option is to have export/import functionality, then archive the exported zip somewhere.

nrjadkry commented 2 months ago

@spwoodcock Do we fill a row for each task in the task_events table with the default state when we create tasks? Or else how does the query above above that you have in finish work? My guess is we pre-fill a row for each task in the task_event table when creating the tasks.

spwoodcock commented 2 months ago

I don't think we want to do that as it adds unnecessary clutter to table.

If there is no record then we know the state is ready to map.

The logic isn't fully fleshed out above, as you are right the map function will have different sql to the finish function.

The map sql should check if there is no record, or if the latest entry is UNLOCKED_TO_MAP (I.e. it was reset and a record was made)

nrjadkry commented 2 months ago

I was looking into this code by qeef https://git.sr.ht/~qeef/dd-hot-tm/tree/master/item/hot_tm_proposal/database/actions_history.py

In the create_project function , he has filled actions table with unlocked to map state for each task at the beginning. So, I was wondering if we would want to follow the same pattern

spwoodcock commented 2 months ago

I think adding that data is redundant personally.

It can probably work by adding an EXISTS clause to the logic or something instead.

But if that doesn't work then adding entries for each task could work for now

spwoodcock commented 2 months ago

@nrjadkry do you think logic like this could work?

WITH last AS (
    SELECT *
    FROM task_events
    WHERE project_id = $1 AND task_id = $2
    ORDER BY aid DESC
    LIMIT 1
),
released AS (
    SELECT COUNT(*) = 0 AS no_record
    FROM task_events
    WHERE project_id = $1 AND task_id = $2 AND event = 'RELEASED_FOR_MAPPING'
)
INSERT INTO task_events (project_id, task_id, event, comment, user_id)
SELECT $1, $2, 'LOCKED_FOR_MAPPING', 'Note: xxx', $3
FROM last, released
WHERE (last.event = 'RELEASED_FOR_MAPPING' OR released.no_record = true)
RETURNING project_id, task_id, user_id;
nrjadkry commented 2 months ago

@spwoodcock This actually works. Just had to make a small change in the query here at the last joining last and released.

            FROM last
            RIGHT JOIN released ON true
            WHERE (last.state = :unlocked_to_map_state OR released.no_record = true);