plinder-org / plinder

Protein Ligand INteraction Dataset and Evaluation Resource
https://plinder.sh
Apache License 2.0
163 stars 9 forks source link

Add Python API tutorial notebook #30

Closed yusuf1759 closed 2 months ago

yusuf1759 commented 2 months ago

This PR converts api.md to api.ipynb and adds more context to the tutorials.

yusuf1759 commented 2 months ago

Thanks for authoring this. I think there are still a few remaining issues:

  • Could you remove the parallel api.md?
  • There are some occurences where 'PLINDER' is not written upper case, although it is not referencing the Python package
  • Some markdown formatting is missing: For example some paths are currently not rendered in monospace and the note does not use :::{note} directive.
  • One cell is failing
  • The Overview section is empty. I think it would be good when the user is introduced what the 'idea' behind the public API is, for example how it is split into subpackahes

@padix-key which of the cells is failing? they seem to all the passing on my end.

padix-key commented 2 months ago

The error appears at this code cell:

from plinder.core import get_plindex
annotation_df = get_plindex()
annotation_df.head()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[3], line 2
      1 from plinder.core import get_plindex
----> 2 annotation_df = get_plindex()
      3 annotation_df.head()

File ~/Documents/coding/plinder/src/plinder/core/utils/dec.py:23, in timeit.<locals>.wrapped(*args, **kwargs)
     21 result = None
     22 try:
---> 23     result = func(*args, **kwargs)
     24     log.info(f"runtime succeeded: {time() - ts:.2f}s")
     25 except Exception:

File ~/Documents/coding/plinder/src/plinder/core/index/utils.py:48, in get_plindex(cfg)
     46 cfg = cfg or get_config()
     47 suffix = f"{cfg.data.index}/{cfg.data.index_file}"
---> 48 index = cpl.get_plinder_path(rel=suffix)
     49 LOG.info(f"reading {index}")
     50 _PLINDEX = pd.read_parquet(index)

File ~/Documents/coding/plinder/src/plinder/core/utils/cpl.py:130, in get_plinder_path(rel, download)
    128 cfg = get_config()
    129 root = _get_fsroot(cfg)
--> 130 client = GSClient(local_cache_dir=root)
    131 remote = cfg.data.plinder_remote
    132 if rel:

File ~/conda/envs/plinder/lib/python3.10/site-packages/cloudpathlib/gs/gsclient.py:101, in GSClient.__init__(self, application_credentials, credentials, project, storage_client, file_cache_mode, local_cache_dir, content_type_method, download_chunks_concurrently_kwargs)
     99 else:
    100     try:
--> 101         self.client = StorageClient()
    102     except DefaultCredentialsError:
    103         self.client = StorageClient.create_anonymous_client()

File ~/conda/envs/plinder/lib/python3.10/site-packages/google/cloud/storage/client.py:227, in Client.__init__(self, project, credentials, _http, client_info, client_options, use_auth_w_custom_endpoint, extra_headers)
    224             no_project = True
    225             project = "<none>"
--> 227 super(Client, self).__init__(
    228     project=project,
    229     credentials=credentials,
    230     client_options=client_options,
    231     _http=_http,
    232 )
    234 # Validate that the universe domain of the credentials matches the
    235 # universe domain of the client.
    236 if self._credentials.universe_domain != self.universe_domain:

File ~/conda/envs/plinder/lib/python3.10/site-packages/google/cloud/client/__init__.py:320, in ClientWithProject.__init__(self, project, credentials, client_options, _http)
    319 def __init__(self, project=None, credentials=None, client_options=None, _http=None):
--> 320     _ClientProjectMixin.__init__(self, project=project, credentials=credentials)
    321     Client.__init__(
    322         self, credentials=credentials, client_options=client_options, _http=_http
    323     )

File ~/conda/envs/plinder/lib/python3.10/site-packages/google/cloud/client/__init__.py:271, in _ClientProjectMixin.__init__(self, project, credentials)
    268     project = self._determine_default(project)
    270 if project is None:
--> 271     raise EnvironmentError(
    272         "Project was not passed and could not be "
    273         "determined from the environment."
    274     )
    276 if isinstance(project, bytes):
    277     project = project.decode("utf-8")

OSError: Project was not passed and could not be determined from the environment.

For me the error appears in both cases, when I execute the notebook directly and when I execute it via sphinx-build.

yusuf1759 commented 2 months ago

The error appears at this code cell:

from plinder.core import get_plindex
annotation_df = get_plindex()
annotation_df.head()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[3], line 2
      1 from plinder.core import get_plindex
----> 2 annotation_df = get_plindex()
      3 annotation_df.head()

File ~/Documents/coding/plinder/src/plinder/core/utils/dec.py:23, in timeit.<locals>.wrapped(*args, **kwargs)
     21 result = None
     22 try:
---> 23     result = func(*args, **kwargs)
     24     log.info(f"runtime succeeded: {time() - ts:.2f}s")
     25 except Exception:

File ~/Documents/coding/plinder/src/plinder/core/index/utils.py:48, in get_plindex(cfg)
     46 cfg = cfg or get_config()
     47 suffix = f"{cfg.data.index}/{cfg.data.index_file}"
---> 48 index = cpl.get_plinder_path(rel=suffix)
     49 LOG.info(f"reading {index}")
     50 _PLINDEX = pd.read_parquet(index)

File ~/Documents/coding/plinder/src/plinder/core/utils/cpl.py:130, in get_plinder_path(rel, download)
    128 cfg = get_config()
    129 root = _get_fsroot(cfg)
--> 130 client = GSClient(local_cache_dir=root)
    131 remote = cfg.data.plinder_remote
    132 if rel:

File ~/conda/envs/plinder/lib/python3.10/site-packages/cloudpathlib/gs/gsclient.py:101, in GSClient.__init__(self, application_credentials, credentials, project, storage_client, file_cache_mode, local_cache_dir, content_type_method, download_chunks_concurrently_kwargs)
     99 else:
    100     try:
--> 101         self.client = StorageClient()
    102     except DefaultCredentialsError:
    103         self.client = StorageClient.create_anonymous_client()

File ~/conda/envs/plinder/lib/python3.10/site-packages/google/cloud/storage/client.py:227, in Client.__init__(self, project, credentials, _http, client_info, client_options, use_auth_w_custom_endpoint, extra_headers)
    224             no_project = True
    225             project = "<none>"
--> 227 super(Client, self).__init__(
    228     project=project,
    229     credentials=credentials,
    230     client_options=client_options,
    231     _http=_http,
    232 )
    234 # Validate that the universe domain of the credentials matches the
    235 # universe domain of the client.
    236 if self._credentials.universe_domain != self.universe_domain:

File ~/conda/envs/plinder/lib/python3.10/site-packages/google/cloud/client/__init__.py:320, in ClientWithProject.__init__(self, project, credentials, client_options, _http)
    319 def __init__(self, project=None, credentials=None, client_options=None, _http=None):
--> 320     _ClientProjectMixin.__init__(self, project=project, credentials=credentials)
    321     Client.__init__(
    322         self, credentials=credentials, client_options=client_options, _http=_http
    323     )

File ~/conda/envs/plinder/lib/python3.10/site-packages/google/cloud/client/__init__.py:271, in _ClientProjectMixin.__init__(self, project, credentials)
    268     project = self._determine_default(project)
    270 if project is None:
--> 271     raise EnvironmentError(
    272         "Project was not passed and could not be "
    273         "determined from the environment."
    274     )
    276 if isinstance(project, bytes):
    277     project = project.decode("utf-8")

OSError: Project was not passed and could not be determined from the environment.

For me the error appears in both cases, when I execute the notebook directly and when I execute it via sphinx-build.

This looks like cloud credential issue. @tjduigna should be able to help. Ideally this shouldn't be happening since the bucket is public. If you run gcloud config set project vantai-analysis it should be fine, but external users shouldn't have to do that.

yusuf1759 commented 2 months ago

This looks like cloud credential issue. @tjduigna should be able to help. Ideally this shouldn't be happening since the bucket is public. If you run gcloud config set project vantai-analysis it should be fine, but external users shouldn't have to do that.

Adding a dummy project os.environ["GCLOUD_PROJECT"] = "my-project" seem to fix the issue.

padix-key commented 2 months ago

This cell now works for me. However, now another cell fails:

from plinder.core.scores import query_links
query_links()
---------------------------------------------------------------------------
IOException                               Traceback (most recent call last)
Cell In[13], line 2
      1 from plinder.core.scores import query_links
----> 2 query_links()

File ~/Documents/coding/plinder/src/plinder/core/utils/dec.py:23, in timeit.<locals>.wrapped(*args, **kwargs)
     21 result = None
     22 try:
---> 23     result = func(*args, **kwargs)
     24     log.info(f"runtime succeeded: {time() - ts:.2f}s")
     25 except Exception:

File ~/Documents/coding/plinder/src/plinder/core/scores/links.py:52, in query_links(columns, filters)
     43 query = make_query(
     44     dataset=dataset,
     45     filters=filters,
   (...)
     49     include_filename=True,
     50 )
     51 assert query is not None
---> 52 df = sql(query).to_df()
     53 df["kind"] = df["filename"].apply(lambda x: Path(x).stem.split("_links")[0])
     54 return df

File ~/conda/envs/plinder/lib/python3.10/site-packages/duckdb/__init__.py:457, in sql(query, **kwargs)
    455 else:
    456     conn = duckdb.connect(":default:")
--> 457 return conn.sql(query, **kwargs)

IOException: IO Error: No files found that match the pattern "/Users/kunzmann/.local/share/plinder/2024-04/tutorial/links/*.parquet"
padix-key commented 2 months ago

Now the notebooks works :+1:

padix-key commented 2 months ago

Could you in a final section also cover the data loader?

yusuf1759 commented 2 months ago

Could you in a final section also cover the data loader?

We are holding this off for now.

padix-key commented 2 months ago

I reformatted some parts of the tutorial in my latest commit. I think two section could be more descriptive:

Note that you can reference functions and classes with the Sphinx roles with {func}`some_func()` and {class}`SomeClass` respectively. In the rendered docs, these will become helpful links then, that point to the respective page in the API reference. In addition I found some headings which where simply rendered as bold with *<some heading>*. If instead a Markdown heading (i.e. one or multiple #, depending on hierarchy) is used the output is rendered more nicely and the section appears on the sidebar.

yusuf1759 commented 2 months ago

I reformatted some parts of the tutorial in my latest commit. I think two section could be more descriptive:

  • The PlinderSystem is introduced, but these user gets no information on what can be done with it.
  • I think it is not getting clear enough what the table returned by query_links() does

Note that you can reference functions and classes with the Sphinx roles with {func}`some_func()` and {class}`SomeClass` respectively. In the rendered docs, these will become helpful links then, that point to the respective page in the API reference. In addition I found some headings which where simply rendered as bold with *<some heading>*. If instead a Markdown heading (i.e. one or multiple #, depending on hierarchy) is used the output is rendered more nicely and the section appears on the sidebar.

Updated to reflect this changes.

padix-key commented 2 months ago

The section about PlinderSystem mentions a System and an Entry class. However, these are not part of the public API, or will the become part?

padix-key commented 2 months ago

And load_systems() is neither a part, at least currently. Should we include this function in the public API?

padix-key commented 2 months ago

I pushed a clean-up commit. From my side only the questions regarding the API exposed the user needs be decided before merge.

github-actions[bot] commented 2 months ago

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/plinder/core/utils
  cpl.py
Project Total  

This report was generated by python-coverage-comment-action

yusuf1759 commented 2 months ago

And load_systems() is neither a part, at least currently. Should we include this function in the public API?

I removed this.

yusuf1759 commented 2 months ago

I pushed a further cleaned up commit, addressing all the issues highlighted here. @Ninjani @padix-key Let me know if I missed anything.