pepkit / pepdbagent

Database for storing sample metadata
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Partial PEP updates #59

Open nleroy917 opened 1 year ago

nleroy917 commented 1 year ago

To support the PATCH requests and partial updating of PEPs on pephub, it would be convenient to have a .update_pep() function. I can also just execute raw SQL if needed.

khoroshevskyi commented 1 year ago

Ideally, we should have partial updates of all the value in the column. So for the structure that we have right now, we can't change columns e.g. annotation.

Some useful information: "Ideally, you don't use JSON documents for structured, regular data that you want to manipulate inside a relational database. Use a normalized relational design instead." https://dba.stackexchange.com/questions/115825/jsonb-with-indexing-vs-hstore/115849#115849

In this case my proposition is to move keys: last_update, number_of_samples to a new columns. Otherwise we shouldn't change this values separately.

About changing the PEPs, I think we should do it using complete peppy object

p.s. partial pep updates is ready, but I think it is really a bad idea to change json values separately.

nsheff commented 1 year ago

but I think it is really a bad idea to change json values separately.

I don't think there's any problem to update JSON values separately. Can you explain why you are reluctant?

nleroy917 commented 1 year ago

Would it be easier to supply and API that is more something like this:

values_to_update = {
  "description": "This is a new PEP.",
  "is_private": False,
  "sample_table": "sample_name,file\nfrog1,frog1.txt\nfrog2,frog2.txt"
 }

for key, value in values_to_update.items():
    pepdbagent.update_project("nleroy917/basic:default", key, value)

There needs to be some logic as to whether or not the key-value pairs belong to the project or the annotation... that can be tricky I understand its not just a simple update statement.

khoroshevskyi commented 1 year ago

About project dict, it's fine that all the values (config, sample, subsample) are in one place, because it's one project. But if we are talking about annotation, e.g. last_update, It's better to create new column.

I found information that: "From a performance perspective, it hardly matters whether you change a single piece of data inside a JSON object or all of it: a new version of the row has to be written."

Additionally, if we want to update json, the SQL is more complicated, especially if we want to update few columns, or few values from one column.
Sample sql code:

UPDATE projects 
    SET digest = %s, private = %s, anno_info = jsonb_set(anno_info, '{description}', '%s', '{number_of_samples}', '%s')
    WHERE namespace = %s and name = %s and tag = %s;

Ok, we also can send separate requests to update every key individually. But in that case, it will be inefficient.


p.s. I already wrote function to update values partially, but there is still some errors...:

def update_item(self,
                update_dict: dict,
                namespace: str,
                name: str,
                tag: str) -> UploadResponse

update_dict – dict with update key->values. Dict structure: { name: Optional[str] namespace: Optional[str] tag: Optional[str] digest: Optional[str] project_value: dict private: Optional[bool] anno_info: dict }
namespace – project namespace
name – project name
tag – project tag

https://github.com/pepkit/pepdbagent/blob/6658ad4b518538611ae44bd5f50bc237a028dcd3/pepdbagent/pepdbagent.py#L307-L406

nleroy917 commented 1 year ago

@Khoroshevskyi this is great, and I can integrate this with pephub as it is. Wondering if there is another method for updating things like description or the project config file or the sample tables?

khoroshevskyi commented 5 months ago

In v0.7.0 I added sample update/add/delete functionality. So this is partial solution for this issue