microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
4 stars 3 forks source link

CI/CD should include tests for endpoints used in example notebooks #456

Open kheal opened 5 months ago

kheal commented 5 months ago

We want to make sure the example notebooks in this repo: https://github.com/microbiomedata/notebook_hackathons do not break with any changes or pushes to the NMDC-runtime API.

The following endpoints are used (with example tests).

kheal commented 5 months ago

@brynnz22 - can you also add an example of the endpoint you used to download the tsvs of the taxonomic information? I couldn't figure out an easy way to access the url for this step. The url I'm talking about is in code chunk 30 in this notebook.

brynnz22 commented 5 months ago

@kheal that url was just taken from the metadata retrieved using the _metadata collection _endpoint__ that you already mentioned.

kheal commented 5 months ago

Great - so the three endpoints I point to above will cover the API calls we've used in the notebook, correct? @brynnz22

brynnz22 commented 5 months ago

Yep! That should be right.

PeopleMakeCulture commented 5 months ago

Related to #301

PeopleMakeCulture commented 5 months ago

@kheal Is this the example notebook you're referring to in your first comment? https://github.com/microbiomedata/notebook_hackathons/tree/main/taxonomic_dist_by_soil_layer

Could you share the relevant code chunks in this notebook?

kheal commented 5 months ago

@PeopleMakeCulture - that is one of the notebooks.

These two also use the runtime API: https://github.com/microbiomedata/notebook_hackathons/tree/main/NEON_soil_metadata and https://github.com/microbiomedata/notebook_hackathons/tree/main/bioscales_biogeochemical_metadata (in both the R and python versions, for a total of 5 notebooks).

Do you want/need me to point to each chunk in each notebook (5 notebooks total) that pings the API?

PeopleMakeCulture commented 5 months ago

@kheal Gotcha. The notebook links should be enough. Thanks!

dwinston commented 5 months ago

@kheal @brynnz22 are the notebooks all "quick"? We could potentially just run them all to make sure they don't error, with e.g. papermill:

import papermill as pm

for nb_filename in nb_filenames:
    try:
        pm.execute_notebook(
            nb_filename,
            'output_' + nb_filename,
            parameters=dict(parameter_name='value')
        )
    except pm.exceptions.PapermillExecutionError as e:
        print("An error occurred during execution:", e)
        # Custom error handling or cleanup code here
        # raise for pytest
kheal commented 5 months ago

@dwinston

Unfortunately no. This notebook takes a couple of hours (in part bc there is not an easy API route to go from biosample ids to data objects, see #355).

The get requests I have at the top of this thread are type examples and should be sufficient as tests to make sure the endpoints are still good.

shreddd commented 4 months ago

We should discuss potential notebook testing options as well, and what makes the most sense. I have some folks in my group who have experience with other Jupyter testing tools like nbmake (https://github.com/treebeardtech/nbmake)

Also fine with papermill but the typical papermill use case I've seen is centered around running a notebook job in parallel where it gets parameterized across different inputs. Just want to make sure we are using the right tool for the job.

I've also seen this (from the same people at Netflix that made papermill) - https://github.com/nteract/testbook

kheal commented 4 months ago

I should have noted that the other four notebooks in these locations https://github.com/microbiomedata/notebook_hackathons/tree/main/NEON_soil_metadata and https://github.com/microbiomedata/notebook_hackathons/tree/main/bioscales_biogeochemical_metadata are pretty quick and it'd be great to have those tested in the CI/CD.