pepkit / pepdbagent

Database for storing sample metadata
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Additional functionalities/columns/tables #2

Closed khoroshevskyi closed 1 year ago

khoroshevskyi commented 2 years ago

Please comment this issue, or edit it if you have any comments or suggestion what columns or tables should be added to db or functionality to pepagent. @nsheff @nleroy917

nleroy917 commented 2 years ago

For pephub, the only columns that are necessary are the namespace and project_id. All else is stored in the object representation of the PEP, I believe.

khoroshevskyi commented 2 years ago

I just added namespace column. Now database has: id, project_name, project_value, description, namespace, n_samples_project

For pephub, the only columns that are necessary are the namespace and project_id. All else is stored in the object representation of the PEP, I believe.

nsheff commented 2 years ago

maybe version? do you want to allow pephub to host different versions/tags of the same project?

nleroy917 commented 2 years ago

Even the description/n_samples_project columns might not be necessary... That should most likely will be held in the actual PEP object.

nsheff commented 2 years ago

Now database has: id, project_name, project_value, description, namespace, n_samples_project

probably good to order the columns logically (put namespace before project_name)

nsheff commented 2 years ago

Even the description/n_samples_project columns might not be necessary... That should be held in the actual PEP object.

Might you want some of this information stored in the database for easier access, without having to parse or read the whole project, thought?

nsheff commented 2 years ago

what I might do is just have a JSON column, where would put any cached annotation information, like number of samples. that way, we can be flexible and add additional information there if it becomes relevant, without having to change the table schema.

khoroshevskyi commented 2 years ago

Even the description/n_samples_project columns might not be necessary... That ~should~ most likely will be held in the actual PEP object.

Might you want some of this information stored in the database for easier access, without having to parse or read the whole project, thought?

You can find this information information in the json pep object, but If we will store description and number of samples it would be easier to access this data.

khoroshevskyi commented 2 years ago

what I might do is just have a JSON column, where would put any cached annotation information, like number of samples. that way, we can be flexible and add additional information there if it becomes relevant, without having to change the table schema.

I think it's good idea! so new column will be anno_info

nleroy917 commented 2 years ago

Might you want some of this information stored in the database for easier access, without having to parse or read the whole project, thought?

that was my gut reaction But, I thought it might be redundant if it's already there in the PEP. e.g.:

# native python
p = projectDB.get_project("nfcore/demoRNA")
n_samples = len(p.samples)

# within pepdb
p = projectDB.get_project("nfcore/demoRNA")
n_samples = p.n_samples # this isn't native in the peppy API

I guess, my thought, was that a database solves the speed issue since we are no longer reading from disk and loading up .yaml files, so indexing basic info and stats on a repository of PEPs was no longer an issue

khoroshevskyi commented 2 years ago

Might you want some of this information stored in the database for easier access, without having to parse or read the whole project, thought?

that was my gut reaction But, I thought it might be redundant if it's already there in the PEP. e.g.:

# native python
p = projectDB.get_project("nfcore/demoRNA")
n_samples = len(p.samples)

# within pepdb
p = projectDB.get_project("nfcore/demoRNA")
n_samples = p.n_samples # this isn't native in the peppy API

I guess, my thought, was that a database solves the speed issue since we are no longer reading from disk and loading up .yaml files, so indexing basic info and stats on a repository of PEPs was no longer an issue

But we would have to call peppy object then. We can get rid of unnecessary processing by just storing this information in db

nleroy917 commented 2 years ago

Maybe we could have another table. Table 1 stores registry path (#3) and PEP as JSON. Table 2 stores registry path and then annotated info - n_samples, description, etc.

khoroshevskyi commented 2 years ago

Maybe we could have another table. Table 1 stores registry path (#3) and PEP as JSON. Table 2 stores registry path and then annotated info - n_samples, description, etc.

Doesn't it just add complexity of db? Because each raw will be pointing to exactly one raw in second table. I don't see benefits of it (If we are talking about tables)

nleroy917 commented 2 years ago

Doesn't it just add complexity of db? Because each raw will be pointing to exactly one raw in second table. I don't see benefits of it (If we are talking about tables)

That's true and good point. I just feel like if we have another column with an annotation object, it feels like we are beginning to just duplicate the already created PEP json object.

nsheff commented 2 years ago

That's true and good point. I just feel like if we have another column with an annotation object, it feels like we are beginning to just duplicate the already created PEP json object.

The difference is that this annotation object is for database use. It's not part of the PEP.

Another example of something you might store in such an object is the number of times a particular PEP has been requested. So, it's kind of metadata about the PEP :exploding_head:. To me it's separate from the PEP, and another column seems like a good choice.

khoroshevskyi commented 1 year ago

Seems like everything was discussed.