terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
21 stars 13 forks source link

Batch processing for s4 height, TERRA_REF height paper relate #385

Open ZongyangLi opened 6 years ago

ZongyangLi commented 6 years ago

@JeffWhiteAZ and @rmgarnett are leading an initial TERRA REF height estimation paper. I am trying to provide reference data. They are:

  1. 4 sets of plot-level height histogram, using middle 25%, 50%, 75% and full plot points.
  2. A set of sub-plot histogram using full plot points.
  3. Use sub-plot histogram to create a subplot-level ground level data, using data as early as we can find, noting the date.

@rmgarnett will use these data for a better estimation on plant height.

Since Roger is close for us, I need another approach to do such a 'one-off analysis'. @dlebauer @craig-willis

dlebauer commented 6 years ago

@ZongyangLi there are a few options:

The fastest way to process these data would be to use globus to transfer data to comet and run the pipeline as a batch job there.

But this also brings up the question - how will these outputs be different from what is in the beta release? Is there anything about the data in the beta release that makes it invalid? If the code is ready it may also be an option to deploy the new version and reprocess all of the data.

Ideally - but perhaps not necessarily - we should use the publicly available data in our publications.

ZongyangLi commented 6 years ago

@dlebauer I didn't realize that these outputs would be added into our database. To my understanding, the data is an initial data looking for a further investigation. So what I expected is a file system like Roger, we can easily access all the data from there, and save outputs on the file system.

If we would like to deploy the new version, it should be save as a different 'method' compared with 'canopy_height'. That's an option. My question is:

  1. how soon can we get the data we want?
  2. How many task can we process at a same time on Nebula with two nodes?
  3. How long would it be taken to finish an existed extractor, for example flir_full_field and canopy_cover?

I can paste the details of @rmgarnett 's proposal here if that's necessary, they were already in an email CC'ed to a lot of people.

rmgarnett commented 6 years ago

Indeed this is an initial investigation into a methods paper led by @JeffWhiteAZ and @NewcombMaria. It's not yet clear what the conclusion will be.

We can include the data into the database as well. One big question at the moment that is not yet resolved is ground subtraction, and how it should be handled.

dlebauer commented 6 years ago

@rmgarnett should we figure out the ground subtraction before doing a full season of reprocessing?

@ZongyangLi To answer 1) data should be available now on workbench. Can you work out the method there? For processing the full season the data should be available on Globus by Friday for transfer to Comet. @craig-willis can answer 2 and 3.

When we reprocess the data, we would first purge the old canopy height and replace it with the new data. We could either create a new method or update the old one.

We have a few scenarios in mind: 1) use workbench for development, 2) use nebula for real-time processing of the data stream as well as re-processing if it can handled the load and 3) move data to a cluster for reprocessing. Now that ROGER has reached end of life we do not have a cluster mounted to our file system.

rmgarnett commented 6 years ago

@dlebauer If we're processing the season anyway there is no additional cost to computing the requested ground statistics.

craig-willis commented 6 years ago

Late update to this ticket. As discussed during the team meeting last week, we're still working on how to support this type of processing with the loss of ROGER. In the short term, the most reliable approach would be to transfer the desired data to an XSEDE resource via Globus. David has a startup allocation on XSEDE Comet that can be used. You'll just need to create an XSEDE account (https://portal.xsede.org/) and send David the username to add to the allocation. Comet uses the SLURM scheduler (as opposed to ROGER's PBS), but they're quire similar.

ZongyangLi commented 6 years ago

David had help me setup a startup allocation, I am now transferring data to the endpoint of 'SDSC Data Oasis' on Globus, I found a directory named by my username there, and I was able to create directories under the 'temp_project'. After the transferring, I will start running a job through Comet.

ZongyangLi commented 6 years ago

@craig-willis Do you have any idea how can we install or find all the python dependencies on Comet such as PILLOW, OPENCV, lmfit, utm and so on.

dlebauer commented 6 years ago

@ZongyangLi two options:

  1. try pip install virtualenv --env to install modules locally. Does this get you what you need?
  2. open a ticket at XSEDE.org to ask them for their recommendation
    • install appropriate modules? Or ask how to install / use OpenCV?
    • ask about pip install virtualenv
    • link to the docker container that defines dependencies
  3. I've applied for ECSS support which can help us look at the overall workflow.
dlebauer commented 6 years ago

@ZongyangLi could you also add Craig Willis as a user on your Comet allocation?

ZongyangLi commented 6 years ago

Sure @craig-willis Could you send me your username on XSEDE?

ZongyangLi commented 6 years ago

@dlebauer @craig-willis The transfer from globus terraref to comet scratch stopped when it reach 1 TB. It's the same as my Data Oasis Limits.

ZongyangLi commented 6 years ago

Let me try it again, I start two transfer and they both stopped after transferring about 500 GB data. Maybe it's one time 500GB limits.

dlebauer commented 6 years ago

Perhaps try opening a ticket with Comet support? On Fri, Jan 26, 2018 at 9:21 AM ZongyangLi notifications@github.com wrote:

Let me try it again, I start two transfer and they both stopped after transferring about 500 GB data. Maybe it's one time 500GB limits.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/computing-pipeline/issues/385#issuecomment-360811907, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX51qyhr54iBKukBCJv-gjsLyxJ6_Cks5tOexUgaJpZM4RZ5M- .