volkamerlab / opencadd

A Python library for structural cheminformatics
https://opencadd.readthedocs.io
MIT License
91 stars 18 forks source link

Add opencadd.databases.klifs module #39

Closed dominiquesydow closed 4 years ago

dominiquesydow commented 4 years ago

Description

Refactor klifs module with object-oriented API for remote and local requests, using remote and local session similar to requests sessions as proposed by @jaimergp in here.

New module structure:

opencadd/
    databases/
        klifs/
            api.py  # Defines the API for local and remote sessions.
            local.py  # Defines the API for local queries.
            remote.py  # Defines the API for remote queries.
            core.py  # Defines the parent classes used in the local and remote modules.
            utils.py  # Defines utility functions.
            schema.py  # Defines the schema for class method return values.
            parser.py  # Defines parsing functions (mol2/pdb to DataFrame/RDKit)

Todos

Questions

Status

dominiquesydow commented 4 years ago

Hi @jaimergp,

Could you please review the object-oriented API I am suggesting here?

A few months ago, we talked about requests-like sessions - this is how I would implement the idea.

dominiquesydow commented 4 years ago

Please wait with the review - I will move the code from the notebook to an actual module.

dominiquesydow commented 4 years ago

@jaimergp, now the API is ready for review :)

dominiquesydow commented 4 years ago

Good morning @jaimergp,

I have a quick question on best practices.

In opencadd.databases.klifs, we need to streamline column names for all the different remote and local tables (it is a wild mix of names).

I defined the renaming here at the top of the utils module: https://github.com/volkamerlab/opencadd/blob/databases_klifs_api/opencadd/databases/klifs_new/utils.py

Is it ok to define the mapping (two large dictionaries) in the utils module or should this live in e.g. a json file (if so, where would that json file go)?

jaimergp commented 4 years ago

The mapping can (and I'd even say should) live on a Python file, so your first idea is spot on :) Since this is a static definition, it might be even worth its own module, like opencadd.databases.klifs.definitions or .schema.

dominiquesydow commented 4 years ago

Nice! I like schema a lot! Thank you!

codecov-commenter commented 4 years ago

Codecov Report

Merging #39 into master will increase coverage by 32.17%. The diff coverage is 93.26%.

dominiquesydow commented 4 years ago

Hi @jaimergp,

This PR is almost ready to be merged - with one question open (see PR description).

The tutorial shows more or less all the functionalities of this module: https://github.com/volkamerlab/opencadd/blob/databases_klifs_api/docs/tutorials/databases_klifs.ipynb

Should we do a full code review here before merging to master? How do you propose to do this since the PR is pretty big?

jaimergp commented 4 years ago

I cannot use file directly in my unit test files (using currently name instead). file does not return absolute paths to my data, what am I missing? Maybe missing `init.py' files?

Can you link to specific instances where this is happening? Sorry, I am catching up on all notifications and there's a looot to cover :)

dominiquesydow commented 4 years ago

@jaimergp, I figured it out - I think :)

pytest is always run from the package top level directory, here opencadd/. That means whenever I use __name__ in file paths (to load test data), I need to set the path from that package top level directory to the directory with my test files.

This will work:

PATH_TEST_DATA = (
    Path(__name__).parent / "opencadd" / "tests" / "databases" / "data" / "KLIFS_download"
)

This will not (since I am omitting a few subdirectories between the package top level directory and the test data directory):

PATH_TEST_DATA = (
    Path(__name__).parent / "data" / "KLIFS_download"
)

However this will work, since __file__ really gets the path from the unit test file, not the package top level directory:

PATH_TEST_DATA = (
    Path(__file__).parent / "data" / "KLIFS_download"
)
dominiquesydow commented 4 years ago

Let me know if you want a different opencadd/opencadd/tests directory structure.

At the moment:

tests/
    databases/
        data/
            KLIFS_download/
        test_klifs_submodule.py
    structures/
        test_superposition_submodule.py

We also could move data/ to tests/ like this?

tests/
    data/
        klifs/  # Use here module names?
    databases/
        test_klifs_submodule.py
    structures/
        test_superposition_submodule.py
jaimergp commented 4 years ago

I prefer a single, top-level tests.data directory! Easier to manage. Inside that directory we can stick to the package hierarchy or just a flat list if the amount of files is not huge. I usually default to start simple if possible (flat list here).

dominiquesydow commented 4 years ago

I prefer a single, top-level tests.data directory! Easier to manage. Inside that directory we can stick to the package hierarchy or just a flat list if the amount of files is not huge. I usually default to start simple if possible (flat list here).

Perfect!

We now have the following:

tests/
    data/
        klifs/  # Contains the KLIFS files in the folder structure as set in the KLIFS download
            HUMAN/
            MOUSE/
            KLIFS_export.csv
            KLIFS_metadata.csv
            overview.csv
    databases/
        test_klifs_submodule.py
    structures/
        test_superposition_submodule.py
dominiquesydow commented 4 years ago

Merge this branch into add_io_klifs_subpockets branch, see PR #44.