Closed khoroshevskyi closed 1 year ago
For pephub, the only columns that are necessary are the namespace
and project_id
. All else is stored in the object representation of the PEP, I believe.
I just added namespace column. Now database has: id, project_name, project_value, description, namespace, n_samples_project
For pephub, the only columns that are necessary are the
namespace
andproject_id
. All else is stored in the object representation of the PEP, I believe.
maybe version? do you want to allow pephub to host different versions/tags of the same project?
Even the description
/n_samples_project
columns might not be necessary... That should most likely will be held in the actual PEP object.
Now database has: id, project_name, project_value, description, namespace, n_samples_project
probably good to order the columns logically (put namespace before project_name)
Even the
description
/n_samples_project
columns might not be necessary... That should be held in the actual PEP object.
Might you want some of this information stored in the database for easier access, without having to parse or read the whole project, thought?
what I might do is just have a JSON column, where would put any cached annotation information, like number of samples. that way, we can be flexible and add additional information there if it becomes relevant, without having to change the table schema.
Even the
description
/n_samples_project
columns might not be necessary... That ~should~ most likely will be held in the actual PEP object.Might you want some of this information stored in the database for easier access, without having to parse or read the whole project, thought?
You can find this information information in the json pep object, but If we will store description and number of samples it would be easier to access this data.
what I might do is just have a JSON column, where would put any cached annotation information, like number of samples. that way, we can be flexible and add additional information there if it becomes relevant, without having to change the table schema.
I think it's good idea! so new column will be anno_info
Might you want some of this information stored in the database for easier access, without having to parse or read the whole project, thought?
that was my gut reaction But, I thought it might be redundant if it's already there in the PEP. e.g.:
# native python
p = projectDB.get_project("nfcore/demoRNA")
n_samples = len(p.samples)
# within pepdb
p = projectDB.get_project("nfcore/demoRNA")
n_samples = p.n_samples # this isn't native in the peppy API
I guess, my thought, was that a database solves the speed issue since we are no longer reading from disk and loading up .yaml
files, so indexing basic info and stats on a repository of PEPs was no longer an issue
Might you want some of this information stored in the database for easier access, without having to parse or read the whole project, thought?
that was my gut reaction But, I thought it might be redundant if it's already there in the PEP. e.g.:
# native python p = projectDB.get_project("nfcore/demoRNA") n_samples = len(p.samples) # within pepdb p = projectDB.get_project("nfcore/demoRNA") n_samples = p.n_samples # this isn't native in the peppy API
I guess, my thought, was that a database solves the speed issue since we are no longer reading from disk and loading up
.yaml
files, so indexing basic info and stats on a repository of PEPs was no longer an issue
But we would have to call peppy object then. We can get rid of unnecessary processing by just storing this information in db
Maybe we could have another table. Table 1 stores registry path (#3) and PEP as JSON
. Table 2 stores registry path and then annotated info - n_samples, description, etc.
Maybe we could have another table. Table 1 stores registry path (#3) and PEP as
JSON
. Table 2 stores registry path and then annotated info - n_samples, description, etc.
Doesn't it just add complexity of db? Because each raw will be pointing to exactly one raw in second table. I don't see benefits of it (If we are talking about tables)
Doesn't it just add complexity of db? Because each raw will be pointing to exactly one raw in second table. I don't see benefits of it (If we are talking about tables)
That's true and good point. I just feel like if we have another column with an annotation object, it feels like we are beginning to just duplicate the already created PEP json
object.
That's true and good point. I just feel like if we have another column with an annotation object, it feels like we are beginning to just duplicate the already created PEP
json
object.
The difference is that this annotation object is for database use. It's not part of the PEP.
Another example of something you might store in such an object is the number of times a particular PEP has been requested. So, it's kind of metadata about the PEP :exploding_head:. To me it's separate from the PEP, and another column seems like a good choice.
Seems like everything was discussed.
Please comment this issue, or edit it if you have any comments or suggestion what columns or tables should be added to db or functionality to pepagent. @nsheff @nleroy917