pepkit / pepdbagent

Database for storing sample metadata
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Code simplification #40

Closed khoroshevskyi closed 1 year ago

khoroshevskyi commented 1 year ago

In this PR, I deleted methods that are not used in pephub. In my oppinion, some of this methods are usefull and and were written by @nleroy917 request.

-- Could you please check if this fuctions can be deleted, and won't be used?

I haven't deleted get_project_annotation method, as this is one way to retrieve annotation from project annotation column.

nsheff commented 1 year ago

In my oppinion, some of this methods are usefull and and were written by @nleroy917 request.

So are you saying you'd prefer not to delete them, though you're doing that in this PR? Which ones specifically do you want to keep?

Is there any way for you to maintain the functionality while still simplifying the code, by reducing redundancy? In other words, can you re-use some code to keep these?

For example, get_projects_annotation_by_namespace and get_projects_annotation_by_namespace_tag should probably be merged into 1 function, with a tag parameter defaulting to None

khoroshevskyi commented 1 year ago

For example, get_projects_annotation_by_namespace and get_projects_annotation_by_namespace_tag should probably be merged into 1 function, with a tag parameter defaulting to None

As far, as we are not using this function, I think we can delete them. I changed my mind about one function, I decided to leave get_namespace_annotation, as this method is usefull for pephub (or will be) I think now ready to be merged.

nsheff commented 1 year ago

I'm fine with keeping functions that you think are useful, even if they are not currently being used by pephub -- we may use this outside of pephub (the point of making it modular), so useful functions should be retained...

but just don't make those useful functions redundant, and be judicious about what you define as useful. does that make sense?

are there any others you think are worth retaining?

khoroshevskyi commented 1 year ago

I don't think any other important function left. If we will need them in the future, we can rewrite them and fitting to specific pephub needs.

nsheff commented 1 year ago

this branch is several commits behind the master branch, can you pull those in?

khoroshevskyi commented 1 year ago

this branch is several commits behind the master branch, can you pull those in?

done

nsheff commented 1 year ago

can you clarify the parameters in the constructor of the Connection object?

eg, what is dsn ? https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L35

maybe write a docstring there

nsheff commented 1 year ago

don't mix camel case and underscores: https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L45

maybe use pg_connection ?

nsheff commented 1 year ago

What will upload_project do with a duplicate project?

Write in the docstring here: https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L75

rafalstepien commented 1 year ago

don't mix camel case and undercores:

https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L45

maybe use pg_connection ?

Also you can completely get rid of using dsn and just ask to provide port, host, login, password, database and create the connection string inside the class so that user does not need to worry about, like that:

class Connection:
    def __init__(self, host, port, login, password, database):
        self.connection_string = f"postgresql://{login}:{password}@{host}:{port}/{database}"
nsheff commented 1 year ago

Can you clarify this docstring? https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L96

Do you mean: "If true, the project will be updated"

khoroshevskyi commented 1 year ago

After this we should change pephub, as it uses dsn only ... @nleroy917 ?

khoroshevskyi commented 1 year ago

Can you clarify this docstring?

https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L96

Do you mean: "If true, the project will be updated"

It means, if there is project with the same key, then that project will be updated

nsheff commented 1 year ago

Can you clarify why there is update_project and upload_project, when upload_project has an update=True flag?

Also they have the same docstring. This is a confusing way to structure this.

khoroshevskyi commented 1 year ago

What will upload_project do with a duplicate project?

Write in the docstring here:

https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L75

Duplicated projects won't be uploaded. But if update is set true, then that project will be updated

nsheff commented 1 year ago

Simplify this docstring: https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L326-L327

specify what this function does do (not just what it doesn't do): https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L380 -- what eaxctly is it getting?

nsheff commented 1 year ago

don't mix camel case and undercores: https://github.com/pepkit/pepdbagent/blob/b1cfe46f66a97d3415e6452be23736909742dcdb/pepdbagent/pepdbagent.py#L45

maybe use pg_connection ?

Also you can completely get rid of using dsn and just ask to provide port, host, login, password, database and create the connection string inside the class so that user does not need to worry about, like that:

class Connection:
    def __init__(self, host, port, login, password, database):
        self.connection_string = f"postgresql://{login}:{password}@{host}:{port}/{database}"

I think it's fine to provide this as an alternative connection method, but it needs to be clear in the docstring

khoroshevskyi commented 1 year ago

Can you clarify why there is update_project and upload_project, when upload_project has an update=True flag?

Also they have the same docstring. This is a confusing way to structure this.

update_project is responsible only for updating projects. upload_project is responsible for uploading project. If update=True argument is set then: if the project, that user wants to upload already exists, upload_project will use update_project and project with the same registry path will be updated.

nleroy917 commented 1 year ago

RE: dsn or connection string, I think sometimes it's advantageous to specify the driver SQLAlchemy uses underneath through the directive/scheme.

Im fine removing it. I believe pephub just uses the host/user/pass method.

nsheff commented 1 year ago

Looking good, just a few more comments:

Can you please define "Annotations" maybe in this docstring: https://github.com/pepkit/pepdbagent/blob/97baff4ad380cbd5d996e77d050c0d22ce6f7996/pepdbagent/pepannot.py#L24

I don't really understand what this class is for and I can't figure it out from the code

Rename to just 'commit' or 'commit_to_database'? https://github.com/pepkit/pepdbagent/blob/97baff4ad380cbd5d996e77d050c0d22ce6f7996/pepdbagent/pepdbagent.py#L75

Is the user supposed to use upload_project(update=True) or update_project? Instead of making 2 user-facing ways to do the same thing, maybe change update_project to _update_project to show it's an internal function?

Maybe rename to get_project_by_registry_path ? https://github.com/pepkit/pepdbagent/blob/97baff4ad380cbd5d996e77d050c0d22ce6f7996/pepdbagent/pepdbagent.py#L264

similarly, here: https://github.com/pepkit/pepdbagent/blob/97baff4ad380cbd5d996e77d050c0d22ce6f7996/pepdbagent/pepdbagent.py#L510 and here: https://github.com/pepkit/pepdbagent/blob/97baff4ad380cbd5d996e77d050c0d22ce6f7996/pepdbagent/pepdbagent.py#L606

I don't think you need the intermediate variables here, just use the indexed array directly: https://github.com/pepkit/pepdbagent/blob/97baff4ad380cbd5d996e77d050c0d22ce6f7996/pepdbagent/pepdbagent.py#L624-L628

and here: https://github.com/pepkit/pepdbagent/blob/97baff4ad380cbd5d996e77d050c0d22ce6f7996/pepdbagent/pepdbagent.py#L522-L524

can you update this for digest creation: https://github.com/pepkit/pepdbagent/blob/97baff4ad380cbd5d996e77d050c0d22ce6f7996/pepdbagent/pepdbagent.py#L682

to use this:

json.dumps(
data,
separators=(',', ':'),
ensure_ascii=False,
allow_nan=False,
sort_keys=True,
) 
khoroshevskyi commented 1 year ago

I believe, that I fixed all issues that you mentioned.

Two comments were not addressed:

  1. update_project. In my opinion both upload project (update=True) an update_project have to be in pepdbagent. Why? Because there can be 3 update cases: a. project doesn't exist in database, and user wants to upload it, and if exists we want to update it. (use upload_project) b. project doesn't exist in database, and user wants to update it, only if it exists. (use update_project) c. upload project if it doesn't exist (use upload_project)

The a. option is very helpful for constant update of the projects. e.g. using geofinder and geofetch.

  1. Intermediate variables. They are used in the code, so they are still in the code
rafalstepien commented 1 year ago

I don't understand what's the return of get_namespaces_info_by_list. Is it supposed to return list of all namespaces? If so, why you must pass a list? Why this list can be none?

khoroshevskyi commented 1 year ago

I don't understand what's the return of get_namespaces_info_by_list. Is it supposed to return list of all namespaces? If so, why you must pass a list? Why this list can be none?

This method is returning information about namespaces in the list. If the list is None, then it will return list of informations about all namespaces ([namespace1_info, namespace2_info,...]) In my oppinion it should be separated into two methods. what do you think?

rafalstepien commented 1 year ago

Imo either the docstring should be updated to describe the function behavior because it's confusing, or the function should be splitted to get_all_namespaces, get_namespace_info, and get_namespaces_info_by_list

nsheff commented 1 year ago
1. update_project. In my opinion both `upload project` (update=True) an `update_project` have to be in pepdbagent. Why? Because there can be 3 update cases:
   a. project doesn't exist in database, and user wants to upload it, and if exists we want to update it. (use upload_project)
   b. project doesn't exist in database, and user wants to update it, only if it exists. (use update_project)
   c. upload project if it doesn't exist (use upload_project)

This should be: add_project(overwrite=True, update_only=False)

a: add_project(overwrite=True, update_only=False) b: add_project(update_only=True) c: add_project(overwrite=False)

nsheff commented 1 year ago

Also, see: Postgres: ON CONFLICT DO UPDATE for your insert statement, to avoid 2 separate sql statements.