lsst-epo / citizen-science-notebooks

A collection Jupyter notebooks that can be used to associate Rubin Science Platform data to a Zooniverse citizen science project.
3 stars 1 forks source link

Citizen science notebook creates ~Mb size .zip files in the directory #66

Open beckynevin opened 11 months ago

beckynevin commented 11 months ago

One of the tenets of the rubin notebooks - https://rtn-045.lsst.io/ - is to delete files after you create them. I wonder if we need to incorporate a cell into this notebook that finds and deletes these zip files, @jsv1206 and I recommend that we have a 'cleanup' cell in the notebook that probably calls an external function to find and clean up these .zip files after we send the data (right before we have the retrieve cell).

I've noticed after many runs of this notebook, I have quite a few .zip files hanging out that are cluttering things up. I'd imagine that if a bunch of users are all running this notebook multiple times it could unnecessarily take space on the cloud.

Let us know if you have strong preferences about how to deal with this @bnord @ericdrosas87 @clareh

ericdrosas87 commented 11 months ago

@beckynevin are you referring to section 7.5? If so, it's unclear to me as to if this is referring to "in-memory" objects or filesystem objects (files). It's unclear to me because of the phrase "memory usage" and the reference to the del keyword in Python which is used to remove keys from dicts.

If RTN-045 is referring to files in the filesystem then I am curious if there is a distinction between the RSP Notebook user's home directory's space usage and a shared directory's space usage.

beckynevin commented 11 months ago

I am also talking about section 3.4 and having looked through some of the other tutorial notebooks they use del like so:

fig = plt.figure(figsize=(6, 6))

xvals = [calexp_corners_ra[0], calexp_corners_ra[1], calexp_corners_ra[2], \
         calexp_corners_ra[3], calexp_corners_ra[0]]
yvals = [calexp_corners_dec[0], calexp_corners_dec[1], calexp_corners_dec[2], \
         calexp_corners_dec[3], calexp_corners_dec[0]]
plt.plot(xvals, yvals, ls='solid', color='grey', label='visit detector')
del xvals, yvals

for r, ref in enumerate(set(registry.queryDatasets("deepCoadd", dataId=dataId))):
    deepCoadd_dataId = ref.dataId
    str_tract_patch = '(' + str(ref.dataId['tract']) + ', ' + str(ref.dataId['patch'])+')'
    deepCoadd_wcs = butler.get('deepCoadd.wcs', dataId=deepCoadd_dataId)
    deepCoadd_bbox = butler.get('deepCoadd.bbox', dataId=deepCoadd_dataId)
    deepCoadd_corners_ra, deepCoadd_corners_dec = get_corners_radec(deepCoadd_wcs, deepCoadd_bbox)
    xvals = [deepCoadd_corners_ra[0], deepCoadd_corners_ra[1], deepCoadd_corners_ra[2], \
             deepCoadd_corners_ra[3], deepCoadd_corners_ra[0]]
    yvals = [deepCoadd_corners_dec[0], deepCoadd_corners_dec[1], deepCoadd_corners_dec[2], \
             deepCoadd_corners_dec[3], deepCoadd_corners_dec[0]]
    plt.plot(xvals, yvals, ls='solid', lw=1, label=str_tract_patch)
    del xvals, yvals
    del deepCoadd_dataId, deepCoadd_wcs, deepCoadd_bbox
    del deepCoadd_corners_ra, deepCoadd_corners_dec

plt.xlabel('RA')
plt.ylabel('Dec')
plt.legend(loc='upper left', ncol=3)
plt.show()
beckynevin commented 11 months ago

I guess what I'm proposing might be a separate thing entirely because I'm proposing something within the notebook that will delete the .zip files in the main directory.

ericdrosas87 commented 11 months ago

I guess what I'm proposing might be a separate thing entirely because I'm proposing something within the notebook that will delete the .zip files in the main directory.

We should probably be mindful of both memory/space taken up by files created by the citSci notebooks and memory usage within the notebook itself.

Notebook memory - because the project is charged on a CPU-usage-basis and so we should be mindful of the costs associated with errant memory usage.

Filesystem space usage - because Data Management has a strong preference for Notebook users not using their home directory as long-term storage.

However, that latter point is in contention with the idea that's been discussed of curating a large amount of data once and sending multiple batches from it to Zooniverse over time. I think DM would be amenable to an exception for citSci users storing data in their home directory for long periods of time - should we decide to pursue that strategy.

beckynevin commented 11 months ago

Okay so to your last point, the DM team might be okay with making an exception for citsci users - This makes sense to me for the cutout/ folder, which will have a bunch of cutouts, but what about creating an utility that deletes all extra nonsense .zip files after the data has been sent? Is this standard operation for Rubin or do we just rely upon users to delete all of the random zip files themselves? Here's a screenshot of what I'm talking about - image

ericdrosas87 commented 11 months ago

but what about creating an utility that deletes all extra nonsense .zip files after the data has been sent? Is this standard operation for Rubin or do we just rely upon users to delete all of the random zip files themselves?

It's certainly possible to have a utility function look for .zip files in the user's home directory and delete them. Fairly small level-of-effort I would say, but as to if it's standard operation to programmatically do so - I actually don't know.