lsst-epo / citizen-science-notebooks

A collection Jupyter notebooks that can be used to associate Rubin Science Platform data to a Zooniverse citizen science project.
3 stars 1 forks source link

How to send large data in batches? #68

Open jsv1206 opened 11 months ago

jsv1206 commented 11 months ago

Right now, the code runs the query once and sends all the images from the query in one subject set.

If there is one query to get say, 10,000 images, we probably should have the code split these images into batches of say, 100 and send them as different subject sets. This can maybe be done in a loop within the existing script.

The other issue/bug is, we get the error of having an active branch in the Zooniverse project with prevents us from sending a new subject set. We also need to test if this works when we want to send large data in batches with different subject sets.

ericdrosas87 commented 11 months ago

Both the 10000 limit as a single batch and preventing sending over a new batch when the user already has an active subject are intentional based on the desired UX policy of Zooniverse (based on past conversations with Chris). If that's changed @clareh please confirm with Zooniverse to ensure they are okay with the pipeline allowing this kind of functionality and I can modify the pipeline code as such.

clareh commented 11 months ago

I believe the 10 000 limit is intended to keep PIs engaged in the project (rather than a PI simply sending one huge dataset and then not keeping an eye on the project). This limit can be changed on the Zooniverse side if a PI asks. We should ensure that there is nothing on our side that prevents a larger dataset from being sent (if there is a reason to do so). However, I think this is separate from Sree's suggestion? I think it would definitely be worth supporting the option of sending "smaller chunks" of data to one subject set. I will discuss the "one subject set only" rule with Chris a bit further but I do think needs to be relaxed for several reasons:

ericdrosas87 commented 11 months ago

We should ensure that there is nothing on our side that prevents a larger dataset from being sent (if there is a reason to do so).

I have a boolean column in the database called _excess_dataexception that is meant to be used in just this way. There would need to be a small dev change on our side to the RSP Data Exporter service to make use of it though.

However, I think this is separate from Sree's suggestion? I think it would definitely be worth supporting the option of sending "smaller chunks" of data to one subject set.

That functionality doesn't exist on the Zooniverse platform, or does it? I agree it does sound useful, but I think it would require development work on their side.

jsv1206 commented 11 months ago

I think it would be great to have all the data under one "project" in Zooniverse, even if there are multiple subject sets within that project. I think the data can be shown to public by selecting all the subject sets under the existing project.

I was also considering scenarios where the PI would run the data query from RSP once and get all the data in one go. The current tutorial notebook doesn't have the functionality to split that data into chunks (for different subject sets) which can be sent to Zooniverse (ideally into one project). The PI can decide how much data to send into each subject set (the max being 10000 currently), unless they want to send more.

@clareh I don't think we need to send data to one existing subject set right? Especially when that functionality doesn't exist on Zooniverse side. We can probably work with multiple subject sets.