[BUG]: `markGrantAsViewed()` fails when grant was already viewed by a user in the same agency

Why is this issue important?

An error (surfaced in Datadog) exists where requests to mark a grant as "viewed" fail with a 500 status code.

Impacts of this issue:

We are unable to track any statistics for grant views beyond that it was viewed at least once by a user belonging to a particular agency.
While not a direct impact, another problem that should be noted (as it impacts the implementation plan) is that when a user is deleted, the deletion cascades to all grants_viewed table entries associated with that user. While not necessarily problematic on its own, the fact that tracking whether a grant has been viewed is effectively limited to per-agency granularity means that all "viewed" state for an agency is lost for all grants that were first viewed by the deleted user.
- In other words, when a user is deleted, grants_viewed table entries for the user's agency are deleted if that user was the first in the agency to view a particular grant.

Current State

When a grant is viewed more than once by any user in the same agency, the API request to track that the grant was viewed fails. This happens in either of the following use-cases:

A user opens the grant details modal (either when searching, viewing assigned grants, or viewing interested grants), and the grant is registered as "viewed" for that user+agency combination.
Either of the following happens (which can occur on when searching, viewing assigned grants, and/or viewing interested grants):
1. The same user views the same grant again.
2. A different user belonging to the same agency views the grant.
Indirectly related, as mentioned in the previous section: when a user is deleted, any grants for which they were the first viewer in their agency also lose their "viewed" status for the entire agency.

The cause of the issue is the composite primary key constraint on the grants_viewed table, which consists of the agency_id, grant_id combination, coupled with the markGrantAsViewed() function's behavior of indiscriminately inserting records without any fallback path when a primary key conflict exists. Furthermore, the primary key for this table fails to account for the possibility that multiple users within the same agency will view a grant, which is a legitimate use-case which will always conflict with the primary key constraint.

Expected State

Each time a grant is viewed, a record tracking views for the grant is created or updated.
Deleting a user does not cause data loss for whether any other user in the same agency has viewed a particular grant.

Implementation Plan

Create a series of new Knex migration that update the composite primary key on the grants_viewed table to include the user_id column. In order to do this, the existing grants_viewed_pkey index needs to first be dropped, then recreated, with a temporary unique key to preserve existing constraints until the new primary key is in-place. The following example demonstrates these operations in raw SQL; these should be implemented accordingly using Knex (note: each step show below should be executed within a separate transaction in order for this query to be run without requiring downtime):

-- 1. Create a temporary unique index CONCURRENTLY to preserve existing behavior
create unique index concurrently idx_grants_viewed_grant_id_agency_id_unique on grants_viewed (grant_id, agency_id);
-- 2. Drop the existing primary key
alter table grants_viewed drop constraint grants_viewed_pkey;
-- 3. Create a temporary new unique index for initializing desired new behavior as an index
create unique index concurrently idx_grants_viewed_grant_id_agency_id_user_id_unique on grants_viewed (grant_id, agency_id, user_id);
-- 4. Create the new primary key, which will require a table lock, but should be fast since it is based on an index
alter table grants_viewed add constraint grants_viewed_pkey primary key using index idx_grants_viewed_grant_id_agency_id_unique;
-- 5. Drop the temporary indexes CONCURRENTLY
drop index concurrently idx_grants_viewed_grant_id_agency_id_unique;
drop index concurrently idx_grants_viewed_grant_id_agency_id_user_id_unique;

Update the markGrantAsViewed() function so that when an insert operation encounters a conflict on the composite primary key columns, the updated_at value for the existing row is updated instead, e.g.

const result = await knex(TABLES.grants_viewed)
    .insert({ agency_id: agencyId, grant_id: grantId, user_id: userId })
    .onConflict(['grant_id', 'agency_id', 'user_id'])
    .merge({ updated_at: 'now' })
    .where({ 'grants_viewed.agency_id': agencyId, 'grants_viewed.grant_id': grantId, 'grants_viewed.user_id': userId });

Add/update unit tests to cover this scenario and help prevent future regressions.

Relevant Code Snippets

No response

@TylerHendrickson after reading through the plan here, I had a couple clarification questions on the implementation path. Would you mind weighing in on these?

What's the background on using a combination of fields as primary key instead of a more standard auto-incrementing id field? If we had a standard id integer sequence as the primary key, alongside a unique constraint index on the fields we want, it would make things like this easier to migrate. Is this an opportunity to migrate to a sequence id field as primary key for the table?
If we wanted to maintain the existing primary key approach (based on the other id keys), I'd love to understand the purpose of all the steps you laid out. Maybe as a way to spur the conversation, I'll lay out what I think is a simpler approach to achieve the same outcome, and you can tell me if there are steps or angles I'm missing.
```
CREATE UNIQUE INDEX CONCURRENTLY idx_grants_viewed_grant_id_agency_id_user_id_unique 
 ON grants_viewed (grant_id, agency_id, user_id);
ALTER TABLE grants_viewed 
 DROP CONSTRAINT grants_viewed_pkey,
 ADD CONSTRAINT grants_viewed_pkey PRIMARY KEY USING INDEX idx_grants_viewed_grant_id_agency_id_user_id_unique;
```
This could all be run as a single knex migration, as the first command would run concurrently until complete, then the second command switches the primary key over in a single atomic command. Fwiw, I ran this locally on my dev setup, and it resulted in the table state we're looking for.
I believe all columns that make up a primary key need to be non-null, but user_id is null right now. My understanding from the docs (and from running the above on my devbox, it checks out) is that the ADD CONSTRAINT automatically converts the fields to non-null. This should be the desired behavior, right? (And we'll need to double-check that the field doesn't contain any nulls.)
In addition to the tasks you listed (write the migration, and update the markGrantAsViewed function) I believe we'll need to also update query logic in places to accommodate the potential for multiple rows returned per grant/agency combo. For example, I think the logic here could return duplicate viewedByAgency records if we don't update it: https://github.com/usdigitalresponse/usdr-gost/blob/d6f0e0e5d1bd8549624467c477b776d412b44328/packages/server/src/db/index.js#L1015-L1020. I'm happy to look through all the queries that touch the grants_viewed table, but wanted to make sure you concur here.

@jeffsmohan Responses to your questions!

What's the background on using a combination of fields as primary key instead of a more standard auto-incrementing id field? If we had a standard id integer sequence as the primary key, alongside a unique constraint index on the fields we want, it would make things like this easier to migrate. Is this an opportunity to migrate to a sequence id field as primary key for the table?

I'm not really sure about the rationale behind using the compound index on (agency_id, grant_id) as a PK rather than a separate unique index/constraint; my guess is that this is an arbitrary implementation detail chosen by the author who originally introduced that table.

I don't think I'm grasping what might be enabled or made easier by switching to a serial PK, but I certainly have no opposition to doing so :)

If we wanted to maintain the existing primary key approach (based on the other id keys), I'd love to understand the purpose of all the steps you laid out. Maybe as a way to spur the conversation, I'll lay out what I think is a simpler approach to achieve the same outcome, and you can tell me if there are steps or angles I'm missing.
CREATE UNIQUE INDEX CONCURRENTLY idx_grants_viewed_grant_id_agency_id_user_id_unique 
 ON grants_viewed (grant_id, agency_id, user_id);
ALTER TABLE grants_viewed 
 DROP CONSTRAINT grants_viewed_pkey,
 ADD CONSTRAINT grants_viewed_pkey PRIMARY KEY USING INDEX idx_grants_viewed_grant_id_agency_id_user_id_unique;
This could all be run as a single knex migration, as the first command would run concurrently until complete, then the second command switches the primary key over in a single atomic command. Fwiw, I ran this locally on my dev setup, and it resulted in the table state we're looking for.

Agreed that your implementation is much more straightforward! It's been a minute since I thought about the implementation details, but iirc the series of migrations I laid out were intended to minimize the impacts of acquiring a full-table lock when the transaction is committed. The alter table ... add constraint ... primary key operation will always incur a full-table lock, but doing it in a transaction that's isolated from building the index ensures that setting the new PK is as close to a config-only change as possible – it doesn't require any scans or index-building at the moment of applying the new PK, since the indexes are already fully up-to-date (and they also ensure that no illegal data was inserted along the way).

I believe that creating the new index and setting the table's PK to that new index in a single transaction will exclusively lock the table while it Postgres scans it to build the new index and ensure no preexisting rows are in conflict. Granted, our traffic patterns are low enough that this probably won't cause a noticeable impact for end-users, so maybe the incremental approach isn't worth the effort (admittedly, this is probably just a force of habit in how I tend to plan out migrations that could be disruptive).

I believe all columns that make up a primary key need to be non-null, but user_id is null right now. My understanding from the docs (and from running the above on my devbox, it checks out) is that the ADD CONSTRAINT automatically converts the fields to non-null. This should be the desired behavior, right? (And we'll need to double-check that the field doesn't contain any nulls.)

That's a good call-out. I just checked and can confirm that (at least right now) there are no null user_id values in the grants_viewed table in either Staging or Production.

In addition to the tasks you listed (write the migration, and update the markGrantAsViewed function) I believe we'll need to also update query logic in places to accommodate the potential for multiple rows returned per grant/agency combo. For example, I think the logic here could return duplicate viewedByAgency records if we don't update it: https://github.com/usdigitalresponse/usdr-gost/blob/d6f0e0e5d1bd8549624467c477b776d412b44328/packages/server/src/db/index.js#L1015-L1020 . I'm happy to look through all the queries that touch the grants_viewed table, but wanted to make sure you concur here.

Good catch- agreed!

I don't think I'm grasping what might be enabled or made easier by switching to a serial PK, but I certainly have no opposition to doing so :)

Sounds good, I'm not sure if it's worth migrating at this point, but I wanted to understand as best I could any context before we messed with the primary key here. As for why I've generally seen "every table gets an auto-incrementing id column for PK" as the generic advice (even when the table has a "natural" compound PK in its own data):

PKs are particularly difficult to alter (as seen in this implementation)
FKs to compound PKs are especially awkward (it requires multiple columns in the referencing table, plus a pretty verbose query syntax, not to mention additional pain if we already had a FK reference that needed to migrate alongside this PK migration to add a field)
We could conceivably get in trouble if something we thought was a unique identifier turns out not to be (e.g., while unlikely, if grants.gov one day publishes a second grant with the same grant ID, we wouldn't handle that correctly)

Anyway, not necessarily worth correcting in this issue. Like I said, just wanted to make sure I wasn't missing some relevant context.

I believe that creating the new index and setting the table's PK to that new index in a single transaction will exclusively lock the table while it Postgres scans it to build the new index and ensure no preexisting rows are in conflict.

I may have been unclear in my proposal. I would want to run the two psql commands (CREATE UNIQUE INDEX CONCURRENTLY ... and ALTER TABLE ...) in two separate transactions. I believe we could even still structure this as a single knex migration (by default, knex migrations are wrapped in a transaction, but there's an option to disable that).

The first sql statement creates the index concurrently in the background, without locking the table.
The second sql statement switches over to the newly created index for the PK in an atomic, locking transaction, but as you say, it should be very quick since the index is already built.

Or am I missing something about that interaction that would cause a longer-lasting full table lock? (Regardless, I do think we're agreed that with the scale of data we're talking about, I don't expect that table lock for the index to last more than a few seconds, but it's good practice to get this right if we can.)

@TylerHendrickson what do you think?

@jeffsmohan TL;DR Consider me on-board with the proposals outlined in https://github.com/usdigitalresponse/usdr-gost/issues/2104#issuecomment-2052349607.

Sounds good, I'm not sure if it's worth migrating at this point, but I wanted to understand as best I could any context before we messed with the primary key here. As for why I've generally seen "every table gets an auto-incrementing id column for PK" as the generic advice (even when the table has a "natural" compound PK in its own data): ...

The reasons you provided seem compelling enough to me that I think any work that involves updating this table's schema definition should probably include migrating to a surrogate PK instead of continuing to rely on the compound PK.

I may have been unclear in my proposal. I would want to run the two psql commands (CREATE UNIQUE INDEX CONCURRENTLY ... and ALTER TABLE ...) in two separate transactions. I believe we could even still structure this as a single knex migration (by default, knex migrations are wrapped in a transaction, but there's an option to disable that).

The first sql statement creates the index concurrently in the background, without locking the table.

The second sql statement switches over to the newly created index for the PK in an atomic, locking transaction, but as you say, it should be very quick since the index is already built.

Or am I missing something about that interaction that would cause a longer-lasting full table lock? (Regardless, I do think we're agreed that with the scale of data we're talking about, I don't expect that table lock for the index to last more than a few seconds, but it's good practice to get this right if we can.)

Thanks for clarifying this. As long as the two DDL statements can be run in separate transactions, I think that should be fine – given that the second statement ALTER TABLE ... DROP DROP CONSTRAINT grants_viewed_pkey, ADD CONSTRAINT grants_viewed_pkey PRIMARY KEY ... can be run atomically, I think that effectively guards against corruption due to data changes in between schema migrations, which is what I was looking to prevent using the "hobbled" migration strategy I outlined originally. If it can be done in one fell swoop (and it seems it can), then I agree that's the way to go!

not sure how to test this one in staging...

@ClaireValdivia Yeah, this one would be tricky to fully test on staging. (You'd have to log in as multiple users from the same organization, then view the same grant, then ensure that didn't trigger an error.) Personally, I think it's sufficient QA testing to:

Verify there's no regression in behavior (i.e., a user can still view and interact with a grant page successfully)
Keep an eye on any error reports from DataDog, and we should no longer expect to see the errors that spurred this PR

usdigitalresponse / usdr-gost