portagenetwork / roadmap

Developed by the the Alliance in collaboration with University of Alberta, DMP Assistant a data management planning tool, forking the DMP Roadmap codebase
MIT License
6 stars 1 forks source link

Address Existing and Prevent Future Duplicate (plan_id, question_id) Entries Within Answers Table #800

Open aaronskiba opened 1 week ago

aaronskiba commented 1 week ago

Please complete the following fields as applicable:

What version of the DMPRoadmap code are you running? (e.g. v2.2.0)

Expected behaviour: The following code within app/models/answer.rb is meant to prevent duplicate (plan_id, question_id) entries from existing within the answers table.

  validates :question, presence: { message: PRESENCE_MESSAGE },
                       uniqueness: { message: UNIQUENESS_MESSAGE,
                                     scope: :plan_id }

Actual behaviour: The following SQL query was performed on a recent database dump from the production environment. It reveals that there are in fact duplicate (plan_id, question_id) entries existing within the answers table.

SELECT COUNT(*)
FROM answers
JOIN (
    SELECT plan_id, question_id
    FROM answers
    GROUP BY plan_id, question_id
    HAVING COUNT(*) > 1
) AS duplicates
ON answers.plan_id = duplicates.plan_id AND answers.question_id = duplicates.question_id;
----------------------------
22

Diagnosis

SELECT COUNT(id) 
FROM answers;
---------------------------
67327

There are 11 (22/2) answers with "duplicates" in the answers table. So only 0.00016% (11/67327) of the answers entries are affected.

This table illustrates how close the created_at values are between the answers and their duplicates (max = 0.014105 seconds).

id duplicate_id plan_id question_id created_at duplicate_created_at time_difference_seconds
26035 26036 6914 67418 2021-04-02 19:54:36 2021-04-02 19:54:36 0
34772 34773 8424 68834 2021-12-17 23:01:46 2021-12-17 23:01:46 0
41114 41115 9347 69714 2022-06-04 02:52:42 2022-06-04 02:52:42 0
41124 41125 9347 69723 2022-06-04 04:46:48 2022-06-04 04:46:48 0
49172 49173 10912 68316 2023-03-22 15:26:56.835211 2023-03-22 15:26:56.842863 0.007652
49711 49712 11005 74820 2023-04-03 07:10:32.197305 2023-04-03 07:10:32.205013 0.007708
52035 52036 11366 68127 2023-05-23 19:21:09.952598 2023-05-23 19:21:09.964106 0.011508
54484 54485 11785 72244 2023-08-21 14:01:49.753142 2023-08-21 14:01:49.767247 0.014105
55605 55606 12007 69720 2023-09-20 00:20:59.242693 2023-09-20 00:20:59.24843 0.005737
60793 60794 12632 75680 2023-11-27 03:02:59.964372 2023-11-27 03:02:59.970709 0.006337
67389 67390 13558 72638 2024-05-08 23:18:26.696402 2024-05-08 23:18:26.700717 0.004315

We can see there is a very small percentage of duplicates, and the created_at difference is very small between the answers and their corresponding duplicates. Given all of this, it seems quite possible that we're sometimes encountering a race condition which causes the question uniqueness validation check to sometimes "fail". When two records are being saved nearly simultaneously, the validation check might pass for both before either is committed to the database, leading to the insertion of duplicate entries.

This seems even more likely when we factor in the 'auto-save' feature being used for answers. If a manual save was executed and an auto-save was triggered at nearly the same moment, then the aforementioned scenario could occur.

Perhaps a migration should be created to enforce the uniqueness constraint at the db level? (Along with the migration, we would also need a script to delete one out of every pair of duplicate answers currently existing within the answers table.)

aaronskiba commented 1 week ago

There are several bugs that stem from these duplicate entries:

  1. User is unable to "Copy" a plan when it has these duplicate answers. (ActiveRecord::RecordInvalid: Validation failed: Question must be unique is raised)
  2. User is unable to Edit an answer when it has a duplicate (ActiveRecord::RecordInvalid Exception: Validation failed: Question must be unique)
  3. The corresponding Plan.section will count the answer plus it's duplicate when outputting how many questions have been answered Screenshot from 2024-06-07 13-54-17