Audit report error on variable_stars branch

beckynevin commented 8 months ago

Describe the bug We're getting an error with the final cell of variable stars notebook, when the flipbook is sent to Zooniverse. The flipbook still appears on Zooniverse, but we get an audit error.

To Reproduce Steps to reproduce the behavior, written in imperative mood:

Go to variable_stars branch
Run all cells of variable_stars_becky.ipynb notebook
The final cell will trigger the error: cutout_dir = batch_dir+"images/" subject_set_name = "test_flipbook" cit_sci_pipeline.send_image_data(subject_set_name, cutout_dir)
See error

Expected behavior Run step 5 without audit error ('5. Notifying the Rubin EPO Data Center of the new data, which will finish processing of the data and notify Zooniverse')

Actual behavior

'1. Checking batch status' '2. Zipping up all the astro cutouts - this can take a few minutes with large data sets, but unlikely more than 10 minutes.' '3. Uploading the citizen science data' '4. Creating a new Zooniverse subject set' '5. Notifying the Rubin EPO Data Center of the new data, which will finish processing of the data and notify Zooniverse' '** Additional information:' root WARNING: Some audit records were not inserted! '6. Sending the manifest URL to Zooniverse' '** Information: subject_set.id: 117463; manifest: https://storage.googleapis.com/citizen-science-data-public/a1d9966d-8917-45f5-a316-836a173b1f2c/manifest.csv' '7. Transfer process complete, but further processing is required on the Zooniverse platform and you will receive an email at beckynevin@gmail.com'

ericdrosas87 commented 8 months ago

The audit report works by tracking object IDs for image data already sent to the Zooniverse via the citSci data pipeline and then comparing the data actively being sent to Zooniverse to see if there are any matches.

With a non-flipbook manifest file, as in the case of the Citizen_Science_Testing notebook, ensuring that each manifest file row represents a single image and its associated object ID is a straightforward process:

Screen Shot 2023-11-10 at 11 42 46 AM

With a flipbook manifest file there are multiple images per row represented as multiple columns (comma separated values), and the object IDs of each image need to be added as columns as well. For example:

Screen Shot 2023-11-10 at 11 43 39 AM

As-is, the flipbook manifest file is created by adding columns for image filenames with the following column naming convention:

location:image_N where N is a number that increments from 0 to N number of images.

My proposal is to following the convention illustrated in the second screenshot above with the following column naming convention:

objectId:image_N where image_N is the image column (as noted directly above) that the object ID is associated with.

Currently, the functionality for creating the manifest file is within the variable_stars notebook and will need to move to utils.py. This new function could be named make_manifest_with_flipbook_images to follow the convention of the two existing manifest creation functions already in utils.py.

The change required would be something like this(see: ## <-- CODE CHANGE HERE):

To summarize, the changes needed are:

Migrate existing manifest file creation logic from the variable_stars notebook to utils.py
Add the code to get the objectID from the sorted_sources object and append as a new column accompanying each image
Test to ensure the manifest file is created successfully

I will still to need to make the code changes to the backend service so that it can handle this format of manifest file, so if this sounds good with you @beckynevin let me know and I can plan that work.

And of course if you have any questions feel free to let me know!

ericdrosas87 commented 8 months ago

I'll be starting the work on my side today: modifying the backend service to handle the new flipbook manifest format

beckynevin commented 8 months ago

Hi @ericdrosas87 for number 1 can you point to the section of code in this notebook that you're referring to?

I can start working on number 2 - previously I have been avoiding dealing with objectID because the forced source tables I believe only have visit IDs, diaObjectIds, and ccdIDs. This could explain why we're now getting the audit error because previously the code ran an additional query to get object ID.

Correct me if I'm wrong, it sounds like the Zooniverse requires object ID to do this comparison with stuff that has already been sent?

ericdrosas87 commented 8 months ago

can you point to the section of code in this notebook that you're referring to?

Sure it's the cell that starts with the following code:

# main directory
batch_dir = './variable_stars_output/'

previously the code ran an additional query to get object ID.

You removed the code because it was unnecessary at the time or performed poorly/slowly?

Correct me if I'm wrong, it sounds like the Zooniverse requires object ID to do this comparison with stuff that has already been sent?

Zooniverse just requires some kind of unique ID, but it doesn't necessarily need to be the object ID. But that is a great question for @eatyourgreens

eatyourgreens commented 8 months ago

I'm really sorry. I don't work for Zooniverse any more, but I'll ping them to take a look at this.

ericdrosas87 commented 8 months ago

I'm really sorry. I don't work for Zooniverse any more, but I'll ping them to take a look at this.

Ah thank you for doing that and all your help! And I wish you well on the next steps of your career!

trouille commented 8 months ago

Thanks Jim for pinging in the Zooniverse channel. Just wanted to post a quick follow up here that we've got this thread on our radar and will follow up next week if/as needed.

It looks like from the thread you already have the answer you need (i.e., Eric's response: Zooniverse just requires some kind of unique ID, but it doesn't necessarily need to be the object ID. ), but post here if there are additional questions or if you need confirmation of that response.

ericdrosas87 commented 8 months ago

Thank you for confirming @trouille!

ericdrosas87 commented 8 months ago

@beckynevin I've made all changes I can to the backend pipeline service without the object ID query notebook changes that are forthcoming on your side. Please let me know when the notebook has been updated and I can pick this work back up!

beckynevin commented 8 months ago

Our last citsci meeting made me wonder if it's truly necessary to send object ID, which makes me wonder if I should be modifying the notebook to query object ID or just stick to sending the current diaobjectID.

One reason I'm wondering about this is that in our last meeting we were discussing that this might not be a thing that's unique for linking objects between data releases.

beckynevin commented 8 months ago

Following up on the above, it might actually be impossible to associate diaObjectIDs with ObjectIDs:

There is no direct DIASource-to-Object match: in general, a time-domain object is not necessarily the same astrophysical object as a static-sky object, even if the two are positionally coincident (eg. an asteroid overlapping a galaxy). Therefore, adopted data model emphasizes that having a DIASource be positionally coincident with an Object does not imply it is physically related to it. Absent other information, the least presumptuous data model relationship is one of positional association, not physical identity.

So now I'm thing that if we're dealing with any of the prompt-processing based catalog, we won't be able to use object ID to tag stuff being sent to Zooniverse.

beckynevin commented 8 months ago

@ericdrosas87 I have one follow-up question about your above post, and I think this is related to trying to gain an understanding of how the audit report works.

Your proposal for how to change the manifest is the following:

My proposal is to following the convention illustrated in the second screenshot above with the following column naming convention: objectId:image_N where image_N is the image column (as noted directly above) that the object ID is associated with.

Are we getting the audit report error because we're using diaObjectID? In other words, would we still get this error even if we were not sending multiple images for each object? This is related to the details of how the IDs are being cross-matched by the audit reporter -

Or do you think we're getting the error because we currently have the manifest saving the entire png name of the image instead of just the ID, which is the wrong format for it to do it's cross-matching?

ericdrosas87 commented 8 months ago

The audit report is failing because the backend service (rsp-data-exporter) does not know how to handle the variable stars format of manifest file. It expects one image and one object ID per row in the manifest file.

As-is, the backend service will only look for a column in the manifest file that exactly matches the name objectId - or in the case of a variable stars notebook, it would need to be in the format I proposed: objectId:image_1, objectId:image_2, etc.

I have a modified version of the backend service running in a separate environment that I can test out once the changes are made to the variable star notebook.

ericdrosas87 commented 8 months ago

So, as far as the ordering of things go:

I implemented the handling of the variable stars format of manifest file to the development version of the backend service
@beckynevin makes the changes to the variable star notebook's manifest file creation function such that it matches the above proposed format (or some other agreed upon format)
I will then test out the variable star notebook with the development version of the backend service and do any necessary iterative development on it
I will deploy the development version of the backend service to the shared environment that everyone can access for further testing and usage

beckynevin commented 8 months ago

Thanks @ericdrosas87 for clarifying - I have made and pushed the changes to the variable stars notebook branch, you should be good to go to modify diaObjectId to objectId on the fly for your testing.

beckynevin commented 5 months ago

@ericdrosas87 just checking in on this issue. I'm hoping to submit a PR for this branch sometime in the next month and want to get all of these features / issues squared away.

lsst-epo / citizen-science-notebooks

Audit report error on variable_stars branch #83