Change database model for job data

smathot commented 7 months ago

Right now job data is stored such that one entry corresponds to one job and one participant. The actual data is stored as a JSON in a text field. This is problematic for two reasons:

If a job is repeated often, this results in very large JSON text blob that contains the combined results from all repetitions. This results in the database crashing, out-of-sort memory errors, and other symptoms that we've discussed elsewhere. (#152 , #150, #92 )
If an experiment is removed or changed, or if a participant is removed, the data is removed as well. This is undesirable because the data should remain available even after such changes.

To fix this, the database model should be refactored such that multiple runs of the same job are stored as separate entries, and that these entries remain accessible when linked experiments / participants are removed.

dschreij commented 7 months ago

@smathot @nclaidiere I finally found some time and started working on this. I think the new proposed way of storing and retrieving data will be an improvement overall. There is only one caveat:

and that these entries remain accessible when linked experiments / participants are removed.

I need to relationally link the data to at least one of these database records, and I have chosen studies. This implies though that if a study is deleted the stored data will go with it. For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain.

@nclaidiere I remember you sent me this a while ago but I cannot find it anymore, but can you send me an export or dump of your current database? I want to convert the old data tabel structure to the new one in a migration, but I want to see if this works well with the production data, to be really sure nothing will break.

smathot commented 7 months ago

@dschreij Thanks for picking this up. Can you hold this thought for a few days though? @nclaidiere @kristian-lange and I are having a meeting next Wednesday to discuss the next steps for OpenMonkeyMind, and this one of the things that's on the agenda. I'll get back to you!

Loic-Bonnier commented 7 months ago

Hello, I'm one of the IT in the Laboratory of @nclaidiere. I did some investigation, and relative to some experiment's design, sometimes there is an UPDATE's query in the table "job_states". I understand this table has all the job and sometimes there are one, two or prehaps three json data in the same row. From my point view this the problem we are looking for. I will try to correct that with the rewrite plugin of mysql but it's not in the Docker image use by omm-server.

dschreij commented 7 months ago

Hi @Loic-Bonnier, you are right in your assessment. I have already started with decoupling the job data with the job_states as you can see in the commits that are attached to this issue. Feel free to inspect them and improve them.

The docker image of omm-server only contains the production build of the package to keep the image size lean, so it is likely libraries only used for development are missing.

kristian-lange commented 4 months ago

@smathot I'm looking at the database schema and stumbled over certain tables that have only one column: dtypes, jobstatuses, _accesspermission, _usertypes, participationstatuses. They all have the same columns: an id and a name. I fail to grasp the purpose of them - why not just incorporate them into the table where they have relation to?

E.g. Instead of having two tables for the job status

one could have just one table

and this would reduce the complexity of our schema, makes it easier to handle and develop. Or do I miss some purpose of them?

dschreij commented 4 months ago

Allow me to chime in. This was a deliberate design choice (and a discussion that pops up once in a while again). But I am from the camp that believes that enums are evil 👀. See this article that explains nicely why http://komlenic.com/244/8-reasons-why-mysqls-enum-data-type-is-evil/ Especially point 4 has often bitten me, as has point 2 when you need to change something to the member list. Met vriendelijke groet / with kind regards

Daniël Schreij

On Fri, Jul 12, 2024, 14:54 Kristian Lange @.***> wrote:

@smathot https://github.com/smathot I'm looking at the database schema and stumbled over certain tables that have only one column: dtypes, jobstatuses, access_permission, user_types, participationstatuses. They all have the same columns: an id and a name. I fail to grasp the purpose of them - why not just incorporate them into the table where they have relation to?

E.g. Instead of having two tables for the job status

image.png (view on web) https://github.com/user-attachments/assets/a6f38f99-eb52-472f-9456-980e9f2ed802

one could have just one table

image.png (view on web) https://github.com/user-attachments/assets/aa94d2a8-381c-41a2-b0d2-fe79fd125eb0

and this would reduce the complexity of our schema, makes it easier to handle and develop. Or do I miss some purpose of them?

— Reply to this email directly, view it on GitHub https://github.com/open-cogsci/omm-server/issues/155#issuecomment-2225518134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFIHFE455DYN4HVFYQAYMLZL7GZLAVCNFSM6AAAAABF3EBPOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVGUYTQMJTGQ . You are receiving this because you were mentioned.Message ID: @.***>

kristian-lange commented 4 months ago

@dschreij Hi Daniël, nice you chimed in :). I wasn't aware of this apparent war, people on one side supporting enum types and people on the other side supporting look-up tables. I belong to a third type of people, admittedly a minory, that argue one could just put your status/type data in a simple varchar or tinyint and keep it simple without an extra table or enum.

Advantages

simple schema: no enum type, no extra look-up table
simple queries: no table joins
easy to read (everyone understands "pending" - not everyone "1")

Disadvantages

varchar uses (a bit) more disk space
people need to be disciplined and actually only using the appropriate values
renaming of such status/type data means to go through every row in the table and update the value

For a small project (like OMM) I'm happy to live with the disadvantages and enjoy the advantages.

E.g.

I think this

await ptcp.jobs()
        .where('study_id', study.id)
        .whereInPivot('status', ['pending', 'started'])
        .withPivot(['status'])
        .orderBy('pivot_status', 'desc')
        .orderBy('position', 'asc')
        .firstOrFail()

is easier to read than this

await ptcp.jobs()
        .where('study_id', study.id)
        .whereInPivot('status_id', [1, 2])
        .withPivot(['status_id'])
        .with('variables.dtype')
        .orderBy('pivot_status_id', 'desc')
        .orderBy('position', 'asc')
        .firstOrFail()

.

smathot commented 4 months ago

The main disadvantage of using a reference table that I can see is that this may require an additional join and thus reduce performance. But this may not actually happen in practice, because we are generally not interested in the text description of these fields, and thus we are likely not actually joining the reference tables in our queries. (Or am I wrong about that? If we are regularly joining the reference tables, then that's a performance issue.)

@kristian-lange You mention using a simple tinyint instead. This is actually no different from using a reference table, except that in that case the tinyint also serves as an index to the reference table. The varchar approach I don't like because it makes the table bigger and potentially less performant.

In other words: @kristian-lange, restructure as you see fit while considering that:

There is no need to change the database structure unless there is a clear reason for it. The main reason would be to reduce the number of operations and joins to increase performance.
We don't want to use varchar columns for enum-like data.
There may not be a fundamental difference between your tinyint approach and @dschreij 's reference tables.

Is that helpful?

kristian-lange commented 4 months ago

I'm still trying to get familiar with the database schema and where it could potentially be improved to solve some problems we actually face. For me this discussion about the look-up table / reference table is more like an exercise before tackling the real problems. I agree with @smathot that there is no need to change the schema regarding this.

You mention using a simple tinyint instead. This is actually no different from using a reference table, except that in that case the tinyint also serves as an index to the reference table.

I actually meant only an tinyint and no lookup table. The meaning of the different int values would have been coded in the application. It's similar to what we are doing currently. We seldom actually look up the meaning in the look-up table and we do the database queries with the id, e.g. .whereInPivot('status_id', [1, 2]).

kristian-lange commented 4 months ago

Hi @dschreij, could you please help me to understand why you have chosen a certain way. You wrote:

I need to relationally link the data to at least one of these database records, and I have chosen studies.

And from your commit (#4accc39) I can see that you planed a table _studydata that both links the tables studies and participants.

For me the natural choice to attach the table that contains the job's result data would be to link it to the _jobstates table where the job's result data are currently stored (in field data of type json). Why did you choose to link it to studies and participants.

For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain.

Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? What do you mean with 'embed' and 'data that is stored for the job'? Do you mean the jobs table (But this table does not have a data field)?

And I have two partially related questions:

Why does the table _jobstates not have an id? Was it a deliberate choice? What were the reasons?
Why is the _participantid sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. _jobstates or participations)?

dschreij commented 4 months ago

Hi Kristian, Why did you choose to link it to studies and participants. The main reason to no longer attach job results to the job_states table is that there is a (new) requirement that the data should be retained if the job table is wiped. Currently, if one wants to upload a new job table file, the old jobs are deleted first, along with their data, before the new job table is composed from the uploaded file. To mitigate deleting all result data, I had to decouple it from jobs and tie it to the study instead. Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? Yes, if you view a participant record you will see one can specify extra data in yaml that pertain to the participant, like gender, age, etc. These need to be stored with each trial/job result, so they are exported for each job data row when the data is saved as excel.

Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons?

This is a pivot table, right? Those never really need ids as they have a composite primary key consisting of the participant ID and job ID if I remember correctly. Any combination of these are always unique. You will rarely use ids from a pivot table to retrieve data, as they are only used to be able to join 2 many-many related tables.

Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)

I can't clearly recall, but I think in the session table, the link to the participant is made using its identifier string, and not its ID. This is because the omm client needed to be able to manipulate the session data and using the identifier is easier that using the id for that case. I should have chosen a better name for this participant_id in the session table, so this is my bad

On Sun, Jul 14, 2024, 23:31 Kristian Lange @.**@.>> wrote:

Hi @dschreijhttps://github.com/dschreij, could you please help me to understand why you have chosen a certain way. You wrote:

I need to relationally link the data to at least one of these database records, and I have chosen studies.

And from your commit (#4accc39) I can see that you planed a table study_data that both links the tables studies and participants.

image.png (view on web)https://github.com/user-attachments/assets/87abd619-5506-4d7f-aa73-1bdb977fc152

For me the natural choice to attach the table that contains the job's result data would be to link it to the job_states table where the job's result data are currently stored (in field data of type json). Why did you choose to link it to studies and participants.

For participants, I have chosen to embed their data in the data that is stored for the job, so participants can be safely deleted but their data will remain.

Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data? What do you mean with 'embed' and 'data that is stored for the job'? Do you mean the jobs table (But this table does not have a data field)?

And I have two partially related questions:

Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons?
Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)?

— Reply to this email directly, view it on GitHubhttps://github.com/open-cogsci/omm-server/issues/155#issuecomment-2227488731, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAFIHFB2WYPQXMQUBR6TRWTZMLU3XAVCNFSM6AAAAABF3EBPOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGQ4DQNZTGE. You are receiving this because you were mentioned.

kristian-lange commented 4 months ago

Just to add some formatting to Daniel's answer ;).

Hi Kristian,

Why did you choose to link it to studies and participants.

The main reason to no longer attach job results to the job_states table is that there is a (new) requirement that the data should be retained if the job table is wiped. Currently, if one wants to upload a new job table file, the old jobs are deleted first, along with their data, before the new job table is composed from the uploaded file. To mitigate deleting all result data, I had to decouple it from jobs and tie it to the study instead.

Can you please elaborate on this? Are their data for participants that have to be stored additionally? Are those data different from the job's result data?

Yes, if you view a participant record you will see one can specify extra data in yaml that pertain to the participant, like gender, age, etc. These need to be stored with each trial/job result, so they are exported for each job data row when the data is saved as excel.

Why does the table job_states not have an id? Was it a deliberate choice? What were the reasons? This is a pivot table, right?

Those never really need ids as they have a composite primary key consisting of the participant ID and job ID if I remember correctly. Any combination of these are always unique. You will rarely use ids from a pivot table to retrieve data, as they are only used to be able to join 2 many-many related tables.

Why is the participant_id sometimes stored as a varchar (e.g. table session) and sometimes as an int (e.g. job_states or participations)

I can't clearly recall, but I think in the session table, the link to the participant is made using its identifier string, and not its ID. This is because the omm client needed to be able to manipulate the session data and using the identifier is easier that using the id for that case. I should have chosen a better name for this participant_id in the session table, so this is my bad

smathot commented 1 week ago

@kristian-lange Can this issue be closed now? Or are there other database changes that still need to be implemented?

kristian-lange commented 1 week ago

@smathot We still have to merge the branch database_seeder into master. I made a PR for this: https://github.com/open-cogsci/omm-server/pull/186

open-cogsci / omm-server

Change database model for job data #155