opencdms-components / opencdms-rclimatic

opencdms-rclimatic is an OpenCDMS component for working with the R `cdms.products` library
https://www.opencdms.org
MIT License
0 stars 3 forks source link

Adding RInstatClimatic Processes to `opencdms-process` #25

Open isedwards opened 2 years ago

isedwards commented 2 years ago

@faysal-ishtiaq recently updated the Climatol windrose example in this repository to work with OGC API - Processes (using pygeoapi)

@dannyparsons @lloyddewit @ChrisMarsh82 - we can ask initial questions here in this issue:

  1. Should Python wrapper functions validate the parameter values? Validation will be done in the R functions. Is validation at the Python level also needed?
  2. Should unit tests be written for the Python wrapper functions? Unit tests will be written in the R package. Is unit testing at the Python level also needed?
  3. Is the format for how Python wrapper functions should call and interact with R well defined?
  4. What aspects of the OGC Web Process Service standards do we need to be aware of when writing these wrappers?
isedwards commented 2 years ago

Based on my current understanding (but it would be good to hear @faysal-ishtiaq's thoughts on these):

  1. The Python wrapper functions should specify the required arguments with type annotations. Validation metadata/documentation must be available (e.g. informing the user of a valid range of values) but the actually validation can be performed in the R code as long as the outcome of validation can be passed back to Python to report to the user appropriately, see this related discussion.
  2. If RInstatClimatic Python wrapper functions exist in this repository then we should be able to execute corresponding tests here - but I see no reason why the R tests couldn't be called from Python
  3. See the Climatol windrose example and current README (which needs improving). Feel free to ask lots of questions.
  4. Hopefully the updated Climatol example goes a long way to answering this question. We should be able to fully answer this through testing that everything works as a user would expect.

Note:

isedwards commented 2 years ago

Additional questions

  1. Should we be developing a template for the wrapper functions to clarify things such as standardized comment headers, type hints, parameter validation, exception handling, coding standards etc.?
  2. How and where are date-time columns requested/produced?
    • Will data obtained from the Data API always include a date-time column in the same format with the same name?
    • Can other date-like columns be requested from the Data API? e.g. year, month, day of the year? We suggest that the R functions can optionally create them if not provided - as long as a date column is given.

Answers

  1. By "template" I assume you mean a documented example of a typical wrapper function showing the expected approach? We certainly need this. In addition, the intention is to have base classes that will provide the structure (along with useful docstrings). Below are some relevant comments from a recent discussion in the opencdms-process repository:

    I think we're going to end up with generic classes like class Process() and class ImageProcess(Process) in pyopencdms, e.g. from opencdms.process import ImageProcess.

    We may then have specific classes like Windrose(opencdms.process.ImageProcess) located in opencdms_process in a climatol module (from opencdms_process.climatol import Windrose).

  2. Yes, the Data API will not provide separate/customisable Y, M and day columns - but these could still be available to processes through "process chaining" where the output of the "date_components" process is then passed to the next process.

faysal-ishtiaq commented 2 years ago

@isedwards

Regarding your first two question and current thoughts, I agree with you 100%. As long as, we can see where our tests or validation has gone wrong, we can do it either in Python or R, it really doesn't matter.

Regarding point 3, as of the current code, windrose process will not run because of this line https://github.com/opencdms/opencdms-process/blob/c6ad39f1555869cda44be86708124c1cf450ba2b/opencdms_process/process/climatol/windrose_generator.py#L140, but I kept it intetionally for your comment on this situation. I personally think that the parameters should be well defined.

Regarding question and thought 4, I think we should keep the flexibilty to switch between data source, i.e. csv file or db engine. Regarding R code in this repository, I think any amount of R code is acceptable as long as we don't paste third party R code here and our embedded R code passes all the tests.

I am going through your last comment, please allow me a few minutes.

faysal-ishtiaq commented 2 years ago

@isedwards

Additional questions

  1. Should we be developing a template for the wrapper functions to clarify things such as standardized comment headers, type hints, parameter validation, exception handling, coding standards etc.?

I think that we should document as much as we can. When a developer finds a repository with poor/difficult to understand documentation, they will replace it with some other library as long as it is well documented and they don't have to sacrifice much of the flexibility.

  1. How and where are date-time columns requested/produced?

    • Will data obtained from the Data API always include a date-time column in the same format with the same name?
    • Can other date-like columns be requested from the Data API? e.g. year, month, day of the year? We suggest that the R functions can optionally create them if not provided - as long as a date column is given.

I have written some scripts to clean data or to apply expected format to datetime columns. I guess we can start with documenting those.

Answers

  1. By "template" I assume you mean a documented example of a typical wrapper function showing the expected approach? We certainly need this. In addition, the intention is to have base classes that will provide the structure (along with useful docstrings). Below are some relevant comments from a recent discussion in the opencdms-process repository:

    I think we're going to end up with generic classes like class Process() and class ImageProcess(Process) in pyopencdms, e.g. from opencdms.process import ImageProcess.

I have written a generic dataprocessor class here https://github.com/opencdms/opencdms-process/blob/c6ad39f1555869cda44be86708124c1cf450ba2b/opencdms_process/process/climatol/windrose_generator.py#L94

I haven't put a lot of thought into it, but the way I wrote it, it felt convenient at the moment. Highly appreciate your feedback here.

We may then have specific classes like Windrose(opencdms.process.ImageProcess) located in _opencdmsprocess in a climatol module (from opencdms_process.climatol import Windrose).

  1. Yes, the Data API will not provide separate/customisable Y, M and day columns - but these could still be available to processes through "process chaining" where the output of the "date_components" process is then passed to the next process.

We can document the current datetime format so that people can use it to convert the string to datetime.datetime and then they can manipulate it the way they want.

lloyddewit commented 2 years ago

How and where are date-time columns requested/produced?

  • Will data obtained from the Data API always include a date-time column in the same format with the same name?
  • Can other date-like columns be requested from the Data API? e.g. year, month, day of the year? We suggest that the R functions can optionally create them if not provided - as long as a date column is given.

Answers

Yes, the Data API will not provide separate/customisable Y, M and day columns - but these could still be available to processes through "process chaining" where the output of the "date_components" process is then passed to the next process.

@isedwards I assume that the 'yes' refers to the first part of the question and the Data API will always include a date-time column in the same format with the same name. If we want date-time in a different format for our R processing, then we'll need to convert the data-time info from the Data API format to the format we require. Is that correct?

Do we already know what the Data API date-time format will look like?

Thanks

isedwards commented 2 years ago

@lloyddewit - I believe the API returns ISO 8601 dates. Yes, you will need to convert these to the format that you need

@faysal-ishtiaq - are we close to having a test API end point for Climsoft at api.opencdms.org? (apologies if it's already there and I've forgotten the location - we should improve the service so that it's easier to navigate to the required data)

isedwards commented 2 years ago

Apologies if that was the wrong answer... I've been thinking about the web app all day.

If you are making a call from the Python API the data will be a Python datetime object.

isedwards commented 2 years ago

@faysal-ishtiaq I just reread your comment above. Is there a reason why the Climsoft date isn't currently returned as a Python datetime object? obsDatetime has DATETIME type in the database. Is it returned as text by SQLAlchemy? If so, could you raise an issue in the pyopencdms repository to discuss this further?

faysal-ishtiaq commented 2 years ago

@faysal-ishtiaq - are we close to having a test API end point for Climsoft at api.opencdms.org? (apologies if it's already there and I've forgotten the location - we should improve the service so that it's easier to navigate to the required data)

I implemented it and added some basic tests and it was working. Later Shaibu made some changes and I am not sure about it's current state. I tried to go to https://api.opencdms.org/climsoft but it's returning 404. Probably Shaibu can tell us if there is any modification needed.

faysal-ishtiaq commented 2 years ago

@faysal-ishtiaq I just reread your comment above. Is there a reason why the Climsoft date isn't currently returned as a Python datetime object? obsDatetime has DATETIME type in the database. Is it returned as text by SQLAlchemy? If so, could you raise an issue in the pyopencdms repository to discuss this further?

I saw different formats on different tables, so I left them in their respective format. I am raising an issue.

dannyparsons commented 2 years ago

@isedwards The R-Instat R package (now called cdms.products) has a testing infrastructure (https://github.com/IDEMSInternational/cdms.products/tree/main/tests) using the testthat R package, which is now the standard way of implementing tests for an R package.

Note that the default way of structuring an R package means that tests are not included when the package is built and hence not available to call on an installed package without also having the package source code. They are generally seen as a tool for package development as the package build will fail if any tests fail.

However, there is a way to include tests in the package build and I don't see any reason not to do this. So we can do this for cdms.products so that tests can be run via Python by simply calling testthat::test_package("cdms.products") in R.

I'm just raising this point as in general most R packages do not do this and so their unit tests are generally not accessible. While this is only needed for R packages developed specifically for OpenCDMS this won't be a problem as we can ensure that tests are available and avoid duplication of test writing.

Of course it is also possible to write tests externally to an R package with the testthat package and just store them in .R files so we could always write tests for a package if they are not available or the tests do not serve the specific needs. This may mean some duplication of tests doing it this way, so for our own R packages, I think it makes sense to keep the tests internal and make them accessible externally.

lloyddewit commented 2 years ago

@isedwards @dannyparsons I support @dannyparsons 's suggestion that the Python test calls testthat::test_package("cdms.products") in R. I understand that this will return a summary of the test results to Python (example here). The Python test could then check the test results using a normal assert. The R tests would ensure that the R package works as expected. The Python test layer would also need some additional tests to ensure that the data conversions between Python and R are also working as expected. Together these 2 types of tests should fullfill our testing requirements. The advantage of Python calling the R tests is that it removes test duplication.

In summary, I like @dannyparsons 's suggestion. When he has an (initial) implementation of testthat::test_package, then I'm happy to try and get it working in the Python unit tests.

isedwards commented 2 years ago

Thank you @dannyparsons @lloyddewit - this sounds perfect to me.

In general, when OpenCDMS is using an established third-party package, it won't be necessary to run the tests here. However, with this being the first substantial set of processes that are available it is ideal that we're able to demonstrate the full set of working tests from within the opencdms-process repository.