lsst-epo / citizen-science-notebooks

A collection Jupyter notebooks that can be used to associate Rubin Science Platform data to a Zooniverse citizen science project.
3 stars 1 forks source link

Function-ize the manifest file creation and move it to SDK notebook #30

Closed ericdrosas87 closed 1 year ago

ericdrosas87 commented 1 year ago

Analysis and Design

Changelog

March 27th, 2023 - Added function example showing a class method that can be used to arbitrarily write the CSV manifest file to the RSP Notebook Aspect filesystem

What this enhancement will do:

Enable scientists to create the manifest file programmatically by calling a function that will live in the SDK notebook, and eventually wherever the SDK notebook functionality will eventually live.

Why this enhancement is needed:

Currently, a functional version of the notebook cell that creates the manifest file exists in the Citizen_Science_Testing notebook, but common functionality should exist in the SDK notebook. It is also desirable that this functionality be flexible enough to allow any columns to be appended to the manifest file to enable a broad range of Zooniverse project functionality. A few columns are also required for the RSP Data Exporter service that lives in the EDC to function successfully. As such, a "ManifestFile" class will be created in the SDK notebook that will have functions for appending columns, removing columns, and prepending the mandatory columns required by the RSP Data Exporter service.

How this enhancement will work:

Create manifest file instance:

manifest = ManifestFile("subject set name")

Append new a row to manifest file:

row = {
    "column_1" : "value_1",
    "column_2" : "value_2",
    "column_3" : "value_3"
}

manifest.append(row)

Append new rows to manifest file:

rows = [{
    "column_1" : "value_1",
    "column_2" : "value_2",
    "column_3" : "value_3"
},
{
    "column_1" : "value_4",
    "column_2" : "value_5",
    "column_3" : "value_6"
}]
manifest.append(rows)

Drop column:

manifest.drop_column("column_name")

Log/print manifest file:

manifest.print()

Arbitrarily write CSV to filesystem:

manifest.write_to_filesystem("./some/path")

Output to dictionary:

manifest_dict = manifest.to_dict()

Passing the manifest file object to the EDC:

send_data(batch_directory, manifest)

Requirements

  1. The following fields will be required, but will NOT need to be set by the scientist:
    1. edc_ver_id
  2. The following fields will be required, but WILL need to be set by the scientist:
    1. objectId (e.g.: 112233)
    2. filename (e.g.: "astro_cutout_123.png")
  3. Any additional fields that need to be added can be done so by simply including them in the original dict or array of dicts.
  4. All other Zooniverse manifest file syntax will abided:
    1. https://help.zooniverse.org/getting-started/example/#details-subject-sets-and-manifest-details-aka-what-is-a-manifest

Notes

The onus of abiding by data rights is on the scientist, not the notebook(s).

Desired resolution/approval date

March 28th 2023

beckynevin commented 1 year ago

For the requirements above, in #2 you mention that objID and filename will need to be set by the scientist. I was hoping that we could have a template or examples of how to do this (and also how to do #3) so that scientists don't have to manually do this themselves. Is this the idea and I'm misinterpreting?

beckynevin commented 1 year ago

I'm also wondering about the form of this new functionality within the SDK notebook; to me it seems like the SDK notebook is full of functions that run behind the scenes, whereas this seems like a function we might want some examples or more forward facing tutorials of (I'm thinking in the more advanced notebook tutorial). Would this be better served by a function within the utils.py file? Or because it's related to backend stuff do you want to keep it in SDK?

clareh commented 1 year ago

I think we should further discuss the utils vs SDK. My understanding is that '# 1' is automatically generated and should never be edited, so Im not sure if putting it in the utils is right?? Also, I think Sree might be best placed to answer, but I believe we need more flexibility that just adding columns. For a flip book style one, you need to group the images. I hope Sree can add the format for clarity here!

ericdrosas87 commented 1 year ago

Just a heads up, when you use the number/hashtag sign in a Github Issue it automatically links the issue matching that ID number, so perhaps wrapping #1 in backticks(`) will be more organizationally appropriate?

ericdrosas87 commented 1 year ago

I'm also wondering about the form of this new functionality within the SDK notebook; to me it seems like the SDK notebook is full of functions that run behind the scenes, whereas this seems like a function we might want some examples or more forward facing tutorials of (I'm thinking in the more advanced notebook tutorial). Would this be better served by a function within the utils.py file? Or because it's related to backend stuff do you want to keep it in SDK?

I do think that the SDK notebook is correct place for defining the class and its methods, while the non-SDK notebooks can show how to use the class and its methods with documentation markdown cells. The idea being that there should be no reason the scientists should need to modify the class directly.

"Function-ize" may not be the most semantically correct word if we're abiding by the Python nomenclature, but it sounded better than "callable-ize" or "class-ize".

ericdrosas87 commented 1 year ago

For the requirements above, in https://github.com/lsst-epo/citizen-science-notebooks/pull/2 you mention that objID and filename will need to be set by the scientist. I was hoping that we could have a template or examples of how to do this (and also how to do https://github.com/lsst-epo/citizen-science-notebooks/pull/3) so that scientists don't have to manually do this themselves. Is this the idea and I'm misinterpreting?

Actually, that gets me to thinking that there probably won't be filenames for tabular data, so perhaps that column won't be required, or will only be required if the scientist is curating image data.

Clare and I discussed how we wanted to make the manifest file creation as flexible as possible, so I can add a note about how the ManifestFile class will throw warnings or errors if required columns are missing and how to add them. But "adding the columns" would just be ensuring that the dict/array of dicts contains keys for the required columns. I see this as being example cells in a non-SDK notebooks with accompanying markdown documentation. Does that sound sufficient?

ericdrosas87 commented 1 year ago

My understanding is that https://github.com/lsst-epo/citizen-science-notebooks/pull/1 is automatically generated and should never be edited, so Im not sure if putting it in the utils is right??

Correct, the edc_ver_id is only included here for awareness, but the scientists should never need to interact with it directly. I included it here should you all see anything related to the edc_ver_id, which should never happen unless there's a bug introduced with some future enhancement, then to let me know.

clareh commented 1 year ago

For the requirements above, in #2 you mention that objID and filename will need to be set by the scientist. I was hoping that we could have a template or examples of how to do this (and also how to do #3) so that scientists don't have to manually do this themselves. Is this the idea and I'm misinterpreting?

Actually, that gets me to thinking that there probably won't be filenames for tabular data, so perhaps that column won't be required, or will only be required if the scientist is curating image data.

Clare and I discussed how we wanted to make the manifest file creation as flexible as possible, so I can add a note about how the ManifestFile class will throw warnings or errors if required columns are missing and how to add them. But "adding the columns" would just be ensuring that the dict/array of dicts contains keys for the required columns. I see this as being example cells in a non-SDK notebooks with accompanying markdown documentation. Does that sound sufficient?

I think this does sound sufficient. I will try to find any exceptions that would not fit into this frame work and let you know in the requirements doc

ericdrosas87 commented 1 year ago

I added a changelog section to the top of the A&D which describes the addition of a class method to arbitrarily write the CSV to the filesystem.

ericdrosas87 commented 1 year ago

I had a call with @clareh this morning and we ironed out a far simpler solution. Disregard the initial design on this post and I will post a new one either later today or tomorrow morning.

The summary of the new approach is:

Stay tuned for another Issue creation, after which I will close this one out. Feel free to ask any questions you may have!

ericdrosas87 commented 1 year ago

Closing this out in favor of the A&D from #34