Data archiving strategy

max-zilla commented 5 years ago

We are nearing our full 1 PB allocation on storage condo. With full-field mosaics running on the RGB plant mask & NRMAC data in addition to what we had before, plus new plot-level imagery products, plus ongoing data capture including VNIR, we won't make it through March without taking steps.

https://opensource.ncsa.illinois.edu/confluence/display/CATS/Clowder+Data+Archiving+Support

I created a Clowder wiki page to handle the Clowder UI side of archiving things - so that we can still include metadata and references to e.g Season 1 data even if the raw files are moved elsewhere.

We need to establish and execute archiving strategy. In general, we have talked about:

focus on raw_data first - if we can retain Level_1+ products, that is sufficient for most pipelines unless fundamental problems with a Level_1 extractor is discovered.
focus on older data, e.g. 2016 - Season 1 and 2 data are of less interest than 4 or 6.
focus on Winter Wheat over sorghum? Hopefully this is less necessary once first two steps are addressed.

I'm working on generating some rough file counts by season and product that will help inform this process.

Season 1

[x] VNIR/SWIR
[x] co2Sensor
[x] cropCircle
[x] EnvironmentLogger
[x] flirIrCamera
[x] lightning
[x] ndviSensor
[x] priSensor
[x] weather
[x] ps2Top
[x] scanner3DTop

Season 2

[x] co2Sensor
[x] cropCircle
[x] EnvironmentLogger
[x] flirIrCamera
[x] lightning
[x] ndviSensor
[x] priSensor
[x] weather
[x] ps2Top
[x] scanner3DTop
[x] stereoTop
[x] SWIR
[x] VNIR 14/17TB complete

max-zilla commented 5 years ago

As discussed in the google doc (https://docs.google.com/document/d/1V4zL4W6JwFJQCYJ40d1mUpnlqLqXjoCA6_z7ozJxMUc/edit)

we will start with raw_data for Seasons 1-3 & 5, then Level_1 after
roughly 500 GB target size for each tar chunk

raw data:

Level1+:

The date boundaries for seasons:

             Season 1 <= 2016-07-14
2016-07-14 < Season 2 <= 2016-12-02
2017-04-05 < Season 3 <= 2017-04-05
2017-09-18 < Season 5 <= 2018-03-31

so e.g. for ps2Top Season 1, just start archiving .../sites/ua-mac/raw_data/ps2Top/2016-02-13 and continue until 500 GB, then start the second chunk until you reach 2016-07-14. The sizes aren't perfectly distributed across dates.

srstevens commented 5 years ago

Just a heads up on the archiving . I hadn't seen how far along this was. JD has been archiving everything in daily tar files (so a wrap up of all sensors for a given day) up to this point. These are all already transferred to Tape. Do we want to repackage everything and delete the old tape archive data as the new archive files are rolled in?

Also as far as checksums, I think it would make even more sense if we could generate those in the pipline before or as files are moved around. Would this be something that could be built into the pipeline? Basically the earlier we generate them, the cheaper they should be overall.

robkooper commented 5 years ago

If possible I'd like to keep these separate, one is a raw backup dump that is not ordered except for daily. The other is an archive per sensor per season so this is easier to retrieve when we need to reprocess a whole season (which most likely we will do per sensor).

If we start to run out of tape space, we can start to remove the daily dump files.

srstevens commented 5 years ago

S1 raw_data: co2Sensor, cropCircle, EnvironmentLogger, flirIrCamera, lightning, ndviSensor, priSensor, weather, ps2Top, scanner3DTop. Have all been packaged and moved to the archive. Some still have checksums in progress. As the checksums complete, the raw data will be removed.

The remaining chunks are being generated and transferred (1 in transit while the next is generating). These will run continuously until S1 is done, then S2 will begin.

max-zilla commented 5 years ago

From Sean on slack:

S1 raw_data for SWIR and VNIR are left. VNIR is ~2/3 done, then SWIR. stereoTop was the last to get packaged and moved. I've got that in the queue to be removed from the condo today and then I will also start removing the directories from VNIR that have been archived I expect S1 raw data to be done by Friday

srstevens commented 5 years ago

There are some gaps in between seasons where some of the sensors still have directories and data. Do we want this archived in any way?

dlebauer commented 5 years ago

Yes, these should be archived but not necessarily published - there are sensor tests and baseline bare soil measurements that may be of use (although there are also bare soil measurements pre emergence). So, these are lower priority but should be kept at least until they can be reviewed.

David LeBauer Director of Data Sciences Arizona Experiment Station THE UNIVERSITY OF ARIZONA

Bioscience Research Labs, 207 1230 N Cherry Ave | Tucson, AZ 85721 Office: 520-621-4381 dlebauer@email.arizona.edu

(sent from my phone - please pardon brevity and typos)

From: srstevens notifications@github.com Sent: Thursday, April 11, 2019 10:47 AM To: terraref/reference-data Cc: Subscribed Subject: Re: [terraref/reference-data] Data archiving strategy (#265)

There are some gaps in between seasons where some of the sensors still have directories and data. Do we want this archived in any way?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/terraref/reference-data/issues/265#issuecomment-482145525, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAcX5xS9d7I1t6RcfMlwkNerDNMGZOz7ks5vf0r0gaJpZM4bBOQv.

max-zilla commented 5 years ago

Can this be closed now?

max-zilla commented 5 years ago

Nevermind, will leave open while the checkboxes are being updated.

max-zilla commented 5 years ago

@srstevens is the archiving listed above complete? If so, we need to decide what to archive next - only 14TB free on storage condo right now.

max-zilla commented 5 years ago

we should find out who is using winter wheat data and potentially archive - in meantime, we start with next season of raw_data and continue archiving.

terraref / reference-data

Data archiving strategy #265