Complete Gantry -> Globus pipeline

max-zilla commented 8 years ago

@jdmaloney and I met today to discuss the remaining pieces of the puzzle for getting data from gantry -> globus -> NCSA. We were talking about who should do what and I wanted to start by writing things up. There are two main components:

1 MAC transfer initiator
This will track files coming from the gantry. It should monitor the place on disk where files are being sent, and periodically initiate globus transfers to NCSA with a cron job. This will need some lightweight status tracking to know if a file has been sent to globus yet, whether notification was sent to the NCSA API, etc. so that the same file isn't sent twice and files can be moved or archived after transmission.

When files are ready, this will initiate a globus transfer. This generates a globus ID, which is then sent to the NCSA API along with the initiating globus username for monitoring.

2 NCSA transfer monitor API
This piece will track files coming from globus. There is an API component that allows external users to notify when a new Globus transfer has started. It will monitor the place on disk where globus is writing files, and periodically check with the globus API for transfer status of known IDs. When a transfer is complete, this will notify clowder to index the file and write details to a log file.

The API has a config file where we list the trusted globus users who can submit jobs to it. Those users will send a globus task ID and their username to the API, which adds that task to the list of tasks that are checked each 1-2 seconds for status. If globus status is FAILED, the file is not indexed by clowder but results are still written to the log.

- - - - - - - - - - -

2 is nearly ready. JD and I mostly discussed what 1 would look like:

identifying which files are completely transferred from the gantry and ready to go
way to track at what stage in prep are files coming off the gantry, e.g. a lightweight database/CSV/JSON file tracking whether upload to globus has begun, etc.
Max can write python code to initiate upload to Globus and notify 2's API that transfer has begun.
What else needs to be present?

I'm prepared to help JD with code to smooth out his end of the transfer, but wanted to create this issue so @robkooper and others could weigh in.

max-zilla commented 8 years ago

Thought about this last night, and I think we might be able to use a good chunk of my code for 2 with some modifications to address 1 as well.

2 has methods for tracking the status of file transfers including flags (i.e. IN-PROGRESS, FAILED, SUCCEEDED) and writing those to disk in a logfile every second. A similar process could monitor the directory structure where the gantry files are being stored, use JSON files to store file status along the workflow, etc. It also has hooks to talk to the Globus API as needed - just need a way to submit new tasks. These JSON files would serve the purpose of the tracking database mentioned above.

Remaining question is how we know when files in the gantry directory are ready to go to globus. @jdmaloney mentioned some length of time - say 15 minutes - when, if the file has existed for that long, we might assume the transfer is complete (i.e. from the USB hard drive coming off the field). Not sure if something like inotify could be more graceful in recognizing when that transfer is complete.

robkooper commented 8 years ago

We need to know when the file transfer is finished. We can probably use IN_CLOSE_WRITE of inotify, but the problem is that inotify does not do recursive checks. I ran into this problem with crashplan where it needs to put an inotify on each folder and each file to know when a file is modified, or a new folder/file is added. The result is that the sytem ran out of inotify handles.

One other options is to use a simple find find /home/gantry/data -mmin +15 -type f -print to find all files modified more than 15 minutes ago. Check to see if file is already in you json file, if not then transfer, add the transfer info to your json file (and write to disk). Next check with 2 to see if any transfers are done, for those that are done, move the files to /home/gantry/deleteme.

dlebauer commented 8 years ago

Just to clarify, is (2) being implemented in #62?

robkooper commented 8 years ago

yes

max-zilla commented 8 years ago

I can give an update this week, but I have some big chunks of this done. the code is in my branch here: https://github.com/terraref/computing-pipeline/tree/globus-monitor-api/scripts/globusmonitor

1 MAC transfer initiator gantry_monitor_service.py, config_gantry.json

I have a service that will monitor a folder on disk for new files.
When new files are ready, this will add them to the pending queue. (haven't written logic checks for determining readiness yet - we can talk about how this should work)
Pending queue is used to start a globus task. Right now, all files in queue are sent in one task. I put a TODO to discuss how this should be handled, especially re: metadata. This will depend on how files from gantry are structured - my code supports metadata, but nothing about how we're going to batch things together.
Once (2) API reports transfer complete and Clowder upload done, move the files to a folder where they can be deleted.

2 NCSA Transfer Monitor API setup.sh, globus_monitor_service.py, config.json I patched up the last couple pieces of this, now supports full flow of monitoring directory -> sending Clowder notifications when files are done -> using local file upload with metadata to create Clowder entries and writing log files.

remaining

figuring out how we bundle things together and how we handle metadata. will metadata be at the file level or the dataset level?
will we be sending an entire directory of e.g. 12 files at a time, or more fragmented?
do we rely on metadata to determine creation of datasets/collections, or should this be part of gantry script in some other way? will this result in needing to check whether dataset of a given name exists, in case members of dataset are transferred across multiple globus jobs?

dlebauer commented 8 years ago

I'll answer based on the way the raw data are being written out, which was discussed in terraref/reference-data#2

figuring out how we bundle things together and how we handle metadata. will metadata be at the file level or the dataset level?

metadata will be provided for each 'dataset'. A 'dataset' is defined as a single time point captured by a single sensor (or group of related sensors in the case of, e.g. weather data and stereo cameras). see terraref/reference-data#2

will we be sending an entire directory of e.g. 12 files at a time, or more fragmented?

Lets start with the assumption that we send at least an entire directory at a time. Or, even if we send the directory in parts, that the script would work on the whole directory once it is available. I think the idea is that the lowest level directories (under sensorname/YYYY-MM-DD/YYMMDD_HHMMSS/) are atomic.

do we rely on metadata to determine creation of datasets/collections, or should this be part of gantry script in some other way?

Datasets will be defined as above (there is one metadata.json file per dataset).

Collections and other groupings will have many rules. Lets start by creating

one collection per sensor

one collection per day

Hopefully it will be easy to define and create collections on an as-needed basis.

will this result in needing to check whether dataset of a given name exists, in case members of dataset are transferred across multiple globus jobs?

Is there a reason to split up datasets?

max-zilla commented 8 years ago

That helps, @dlebauer.

I don't think there's a reason to split up datasets unless datasets start to get extremely large which seems unlikely. The remaining question would be how a dataset is recognized as ready to transfer on the gantry side - "once the whole directory is available". One file in a ds could exist for 15+ minutes before another file has completed transfer - we don't want to process the former until the latter is also there. Sounds like for now, we won't even transfer the former until the latter is also there.

Could be recognized when:

when some hard-coded set of files are present, e.g. "2 stereo bin files, 2 VIS files, an NIR file..." and we require all of them to be present. I don't like this one.
the metadata includes some sort of register that lists all the files included - we don't transfer the ds until we have its metadata and we have every file listed in the metadata
other ideas?

dlebauer commented 8 years ago

@max-zilla how about assume folder at time t is finished when when the folder at time t + 1 is created?

robkooper commented 8 years ago

This would not work since that would leave the last folder of the day pending until the next day.

robkooper commented 8 years ago

I think we should allow for files to be added to the dataset. We can find the dataset based on some metadata we store with the dataset. This way we can upload a file and add it to a dataset that was created earlier.

max-zilla commented 8 years ago

I would propose this general process: gantry/sensorname/YYYY-MM-DD/YYMMDD_HHMMSS/file.bin

once a file is ready in gantry, look at folder path to determine sensor and timestamp (YYMMDD_HHMMSS). will be name of dataset.
queue file for transfer in globus. POST transfer details to NCSA monitor API, including dataset name the file should go into.
if we find a JSON file, assume that is metadata corresponding to dataset. don't send json file itself.
in case all we find is a json file without any new data files to go with it, i can add an endpoint to NCSA API that allows one to send metadata for a dataset without a corresponding globus task.
on NCSA side, when we get a finished globus task, loop through each file in the task. if file's dataset doesn't exist yet, create it. then add file to dataset.

This way we don't need to wait for a dataset to be complete before transferring, but we don't create a Clowder dataset twice if the component files get split across two Globus transfers.

dlebauer commented 8 years ago

That sounds good, but the met / environmental data is sent as .json files (terraref/reference-data#26). We could use a rule the file is called metadata.json (or perhaps metadata* is better)

max-zilla commented 8 years ago

Just created a pull request this afternoon with initial code. It'll still need some polish but I think it's close to ready.

max-zilla commented 8 years ago

@robkooper I want to test the services outside my local environment, but Globus Plus is required for transfer between two personal endpoints. You suggested using either Roger or Campus Cluster - I was able to go through the RSA process and get access to Roger's globus endpoint.

I want to be careful in the next step. I can SSH into Roger and git clone the computing-pipeline repo to get my script in my home dir, and create /gantry directory to act as the incoming gantry folder. Questions...

Do you have to use the batch job stuff on Roger no matter what? Is it possible to simply run my python script straight from the command line? Nothing needs to talk to the gantry script from the outside (only vice versa) so we shouldn't need to expose any ports or anything.
if that's possible, what about dependencies? The python requests library for instance. I also thought I could use docker to bundle these things, but docker is apparently not installed on Roger either. What's the proper procedure here? I looked at module avail and module display pythonlibs but there doesn't seem to be a preconfigured module that perfectly fits.

dlebauer commented 8 years ago

Did you try qsub -I for interactive node?

@jdmaloney or @yanliu-chn thoughts?

robkooper commented 8 years ago

@max-zilla use openstack to create a docker image. You should be able to use the existing coreos image we have for the toollauncher (running on nebula) or you can create a new coreos on roger and launch it there.

You can monitor the data going to roger, but you can not transfer it there using your code (unless you get the secrets from @jdmaloney for the transfer user). You can however do the transfer to campus cluster and monitor it to see if everything works as expected.

Keep in mind that roger has 3 pieces

batch/HPC
openstack
hadoop

When you login to roger using SSH you are in the batch/HPC mode, this is not where we will run the gantry scripts, this is where we run the extractors that Yan launches. If you login to openstack on roger, you can launch a VM that is all yours, this is what we want to use. We are not planning on using Hadoop.

jdmaloney commented 8 years ago

@max-zilla I can get you set up to transfer to Roger this afternoon. I think it would be good to do transfer testing on there directly as that's where we're going to be putting the data.

Speaking of data, I've written a temporary parallel transfer tool that is moving data between the gantry and here for now (we're up to 5.4TB of data, and now accumulating multiple TB a day, the gantry seems to be outputting at or close to full capacity).

I have a meeting here at 1PM, but I can try and find you after that Max

max-zilla commented 8 years ago

@jdmaloney I was actually planning on transferring from Roger -> my laptop endpoint as the test so I wouldn't need to write to roger right away, but would be good to set that up either way.

jdmaloney commented 8 years ago

@robkooper @dlebauer I just stopped by and talked to @max-zilla I found a handy bit of information this morning. The program linux uses for setting up the ftp server (vsftpd) has the built in ability to log incoming and outgoing transfers. I've had this logging on since Day 1, and it prints the full path for every transferred file in the log as well as the completion status of the transfer (either c for complete, or i for incomplete). Essentially this will allow us to scrape that log file for all lines ending in "c" and then grabbing the file path off that line to add the "to-be-transferred" list. Since there are no outgoing transfers via ftp, this log file will hold exactly what we need to move. It is located in /var/log/xferlog

max-zilla commented 8 years ago

And the logfile that @jdmaloney describes should mean we don't need to ask LemnaTec to modify their transfer script - the FTP action will appear in the logfile regardless and we can just initiate transfers based on that.

Only question would be, how big will that logfile get, when are things purged from it, best way to avoid duplication of transfer. Do we think about e.g. every week moving the log file to some archive and beginning a new one, or is that already solved for us by vsftpd?

jdmaloney commented 8 years ago

@max-zilla I can set it up with logrotate.d so that every day a new log file is created and the old one is either moved over (newest log file is just xferlog, all others are xferlog-yyyymmdd), or both moved over and compressed to save space. Both are very easy, just a couple lines in a conf file. This way your script only has to consider a day at a time, would that be beneficial?

In terms of size, all transfers up to this point encompass ~1.3 million files (lines in log file), which equates to about 250MB of log file.

max-zilla commented 8 years ago

@jdmaloney do you control the ncsa#roger-gantrytest globus endpoint? I'm running into trouble initiating transfers in python from ncsa#roger - I get a 409 Conflict expired credentials error, as far as I can tell it's because Roger has this as ID provider:

Identity Provider
Type   MyProxy OAuth
Hostoa4mp.ncsa.illinois.edu

...while the globus test endpoint I've successfully used before in a similar way is:

Identity Provider
Type   MyProxy
Hostmyproxy.globusonline.org

Their Oauth documentation talks about web applications routing through a login page to get an oauth token, it isn't discussed on the API side and I don't want to dig into that yet if I don't need to. Thought maybe the gantry endpoint might have different settings (although maybe not, it lists ncsa#roger as Host Endpoint).

Could you grant me access to ncsa#roger-gantrytest if possible, so I could see if it works any differently? Thanks.

jdmaloney commented 8 years ago

@robkooper @max-zilla Just wanted to update the thread. Max and I talked about the above comment.

Also I just finished analyzing the logs from the gantry over the past couple days. Lemnatec is doing a fairly continuous dump of the EnvironmentLogger data, however, all the moving sensor data (the vast majority of data) kicks off around midnight and continues until it is finished dumping that day's data (takes 4-9 hours so far depending on the day) and then stops and waits until the next day.

Thought this might be helpful as we try to figure out how often/when/how to grab data off the server. If Lemnatec is only doing the sensor dump once a day, it may not make sense for us to run continuously.

dlebauer commented 8 years ago

@markus-radermacher-lemnatec at what frequency are you planning to dump sensor data?

dlebauer commented 8 years ago

@jdmaloney where is the log file?

Do you or @markus-radermacher-lemnatec know what sensor(s) are producing the large number of files?

jdmaloney commented 8 years ago

@dlebauer Log file is at /var/log/xferlog on the gantry cache server; below is a snapshot of the sensor data quantities at this time. This is an accumulation of all data gathered so far.

uploads/d065ad2a-5960-4a07-8006-fdde0e0a31cb/Screen Shot 2016-04-05 at 1.44.06 PM.png

jdmaloney commented 8 years ago

@dlebauer Misread your question, file counts are now below as well. File count is actually just shy of 1.1 million

uploads/0ca1817d-0bc0-414a-b258-480362cb1ffe/Screen Shot 2016-04-05 at 2.18.07 PM.png

max-zilla commented 8 years ago

email update from 04/12:

JD and I had discussed Thursday as a possible point to pause the manual transfer from the gantry and try to slide in my monitor code. To that end, I’m trying to test the heck out of it with as many edge cases as possible, but generally the sooner the better to start getting things working at the other end while the data is less critical if we lose a file here and there.

Log file scraping This morning, I continued testing the log file scraping and I think we’re in pretty good shape. The basic logic is:

   Get most recently read log line from monitor_status.json file (initially will be blank on day 1)

o Get FTP transfer timestamp from that line (“last read”)

```
   Check for “xferlog” file in the configured folder and read the first line
```
o Compare [first line] timestamp to [last read] log line § If [first line] is before or equal to [last read] time, our resume point is somewhere in this file. § If [first line] is AFTER [last read] time, there’s a gap in our logs and we need to walk backward to an older log file and get the entries we might’ve missed.

o If our resume point is in this file, iterate through the lines until we find the line that matches our most recently read log line. Then starting on the next line, grab the rest of the entries that end in ‘c’ (completed) and queue them for transfer.

o If we need to walk backward, we expect the following naming structure: · xferlog (latest/current log file) · xferlog-1 · xferlog-2 · … · xferlog-4.gz (eventually these get g-zipped, not necessarily the 4th one back but somewhere) …so recursively check each file the same way (look at first line, is it here? If so, scan for the last line and start from there). However far back we go, we scan the rest of the backlog file(s) and then the current one. So if my last scanned line is found in the xferlog-2 archive file above, it’ll scan all the lines from xferlog-2 after that one, then all the lines from xferlog-1, then all the lines from xferlog until we either reach the end of the file or find an incomplete transfer entry. It’ll auto-unzip the .gz file if necessary.

EDGE CASE: If we have a last-read log file but we go allllllll the way back through the logs without finding it (this should never happen), it’ll just start from line 1 of xferlog.

The way I’ve tried to write it, we don’t really care where in the log files our record is as long as the files follow this naming convention – we’ll look as far back as we can to find the proper resume point, then get everything we can that’s ready.

Globus size management Gantry monitor now has a config entry max_transfer_file_count that defaults to 100 – if we have more than that many files queued for Globus transfers, we’ll batch them into that size. So 200 files with a 100 max size would be chunked into 2 transfers. Don’t know what this magic number will be, but intended to balance good utilization of Globus with not making transfers too big so Clowder has to wait for 5000 files to transfer via Globus before it can start ingesting.

Crash testing Trying to think of all the ways this could fail and handle gracefully. Examples include:

Gantry can’t talk to globus
Gantry can’t talk to NCSA
NCSA can’t talk to globus
NCSA can’t talk to Clowder
404/500/other responses from Clowder or Globus or other
System crashes during action

Part of this is being addressed via the .json log files, which periodically write data objects to disk (pending transfers, active transfers, created datasets, last read FTP log line, etc). The monitors have hooks on initialization that will look for these files and load the data it finds to resume from a previous point. These also keep a .backup copy when the file is updated. That said, I’m positive there are gotchas we’ll run into and I wanna make this as bulletproof as possible.

Clowder data structure I added some basic methods to create clowder datasets/collections as we go, akin to my script for uploading the local Roger files. You can define a single space in the config file where everything will live (but it can be blank). There will be one collection for each Sensor, one for each Date, and that’s all I’ve added so far. If we want to create collections by metadata etc. I have some things in place to support that but haven’t started down that path yet.

Purging gantry cache My config file right now just allows you to define a folder (default /home/gantry_delete) where gantry files will be moved once Clowder confirms the whole pipeline was completed for that file. This has not been tested at all yet, and I created the folder for now rather than outright deleting files for safety reasons. We can discuss what should happen here.

max-zilla commented 8 years ago

Todos that remain which may not be done before first rollout:

Collate the Environmental Logger JSON files. There are a ton of these and they're small - either LemnaTec or us should combine these into one longer file before transmission.
Refinement of Clowder organization - are there additional collections/spaces that should be created from metadata? Rob and I discussed a 'smart collection' notion that would be potentially defined by some XML structure to say "create collection with this metadata field as first level, then this metadata field as second level, etc" to automatically create a nice tree structure for navigating.

Will contact Globus to ask about:

Unexpected 409 NoCred ClientError (credentials not authorized) when initializing Globus transfer via python API. Encounter this error sometimes even after refreshing goauth credentials - error stops if I go to Globus site and go to Transfers page & select endpoints there. Don't even need to do a transfer from browser, just going to that interface does the trick. I wonder if there's an undocumented handshake of some kind I can do with API to avoid this.
Is it possible to sync timestamps between source and destination? In browser UI there is a checkbox to do this in a transfer, but I haven't found a way to do it via python API.

dlebauer commented 8 years ago

Proposed changes to the environmental logger file are listed in issue reference-data#26

dlebauer commented 8 years ago

Answers regarding Globus from @bd4 (via email):

The 409 NoCred error means that you need to re-activate the source and/or destination endpoints, likely because the credentials they were activated with have expired. It's a confusing name, but it's referring to the endpoint activation credentials, not the credentials used to authenticate to the REST API.

When you activate an endpoint, it saves the credential with the endpoint, for the entire user identity. This means that if you use an endpoint via www.globus.org, you will end up activating it, and that will affect any Transfer API scripts that are running under the same identity. Using the website concurrently with testing your script is one possible reason for the 409 NoCred going away.

For preserve mod time, the option is called "preserve_timestamp". It's listed in the docs here: https://docs.globus.org/api/transfer/task_submit/#transfer_specific_fields

max-zilla commented 8 years ago

I'm going to look into using api.endpoint_autoactivate(<endpoint_id>) when this error is encountered to hopefully avoid the endless error state I was running into before.

bd4 commented 8 years ago

Autoactivate will help, in that it will use credentials available for other endpoints if available. For example, if you have used one XSEDE endpoint on the web and activated it, any other XSEDE endpoint can be autoactivated, by copying the credential to the second endpoint's activation record. This is basically a hack to make activation look more like it's per identity provider, when the model we chose initially was per endpoint.

The thing is, that one credential that is used to activate the first XSEDE endpoint (when you get redirected to the XSEDE portal and have to sign in) will eventually expire. At that point, a new credential will need to be acquired somehow. You can't escape activation completely, unless your application is limited to using shared endpoints and Globus Connect Personal endpoints, which can always be autoactivated because they essentially rely on Globus as their identity provider (and don't require authenticating with a third party IdP).

max-zilla commented 8 years ago

@robkooper @dlebauer After a few last-minute things that needed to happen, the plan is for JD and I to switch over to my gantry monitor service on Monday morning.

@jdmaloney – you installed docker already, so on the cache server here's what we need to do.

Copy over a prepared version of config_custom.json with some key parameters filled in - I'll leave a 2nd comment to address this. The folder where you put this config file will also be where the /log and /completed files are written - this will be LOG_DIR below.

On the cache server: docker pull maxzilla2/terra-gantry-monitor

...to pull the image from dockerhub.

docker run –p 5455:5455 –v LOG_DIR:/home/gantry/data -v /gantry_data/LemnaTec/MovingSensor:/home/gantry/sensors -v /gantry_data/deleteme:/home/gantry/delete/sensors -v /var/log:/var/log maxzilla2/terra-gantry-monitor

...This will run the container. The -p flag will expose that port on the Gantry cache - this has no authentication barrier because it's intended for internal use only by the Gantry folks. In practice if the FTP log scraping is behaving we probably won't ever need to do so, but this allows one to POST file(s) to be transferred to globus at localhost:5455/files

The -v flags will mount 4 directories on the gantry cache inside the docker container:

directory where all the monitor log stuff and config_custom.json go.
directory that contains MovingSensors and the other raw data
directory where files will be moved (in same subdirectory structure) to be deleted
directory where xferlog, xferlog-1... xferlog-N.gzip are written from FTP.
The paths after the : are the default config values, so if we don't overwrite them in config_custom that's where the script will expect to find things inside the docker container. We just need to mount external directories inside the docker container at the correct paths.

We start up the NCSA piece on Roger with the proper credentials. I can test a good chunk of this beforehand. At that point we need to confirm: 1) The gantry script can ping Globus, Roger and our API successfully 2) The Roger script can ping Globus and receive messages from the gantry via the API 3) Globus endpoints are running

...then, we hold our breath and send a file via FTP to start it up.

max-zilla commented 8 years ago

GANTRY SIDE The script will load parameters from config_default.json, then update any overrides from config_custom.json. The default values I think we can avoid changing:

{
----- remember, these paths are INSIDE DOCKER, not on the gantry cache - we just mount to these
  "log_path": "/home/gantry/data/log/log.txt",
  "active_tasks_path": "/home/gantry/data/log/active_tasks.json",
  "pending_transfers_path": "/home/gantry/data/log/pending_transfers.json",
  "completed_tasks_path": "/home/gantry/data/completed",
  "status_log_path": "/home/gantry/data/log/monitor_status.json",

  "gantry": {
    "incoming_files_path": "/home/gantry/sensors",
    "deletion_queue": "/home/gantry/delete/sensors",
    "ftp_log_path": "/var/log",
    "file_check_frequency_secs": 120, ----- how often the FTP log is read for new files
    "globus_transfer_frequency_secs": 180, ----- how often to bundle Globus transfers
  },
  "globus": {
    "authentication_refresh_frequency_secs": 43200, ----- refresh Globus auth
    "max_transfer_file_count": 100 ----- max number of files for 1 Globus transfer
  },
  "ncsa_api": {
    "api_check_frequency_secs": 180 ----- how often to ask Roger API if a transfer is done
  },
  "api": { ----- this is the little API to manually submit jobs to gantry monitor
    "port": "5455", 
    "ip_address": "0.0.0.0"
  }
}

...and the ones we might want to override:

{
  "globus": {
    "destination_path": "/home/projects/arpae/terraref/.../", ----- where to put Globus transfers
    "source_endpoint_id": "...", ----- MAC endpoint
    "destination_endpoint_id": "...", ----- ROGER endpoint
    "username": "",
    "password": "" ----- credentials of Globus account that will do the transfers
  },
  "ncsa_api": { 
    "host": "http://141.142.0.0:5454" ----- Roger API (the other half of this puzzle)
  }
}

ROGER SIDE Same deal, the few default params that work:

{
  "globus": {
    "authentication_refresh_frequency_secs": 43200,
    "transfer_update_frequency_secs": 120 ----- how often to ask Globus if transfers done
  },
}

...and ones to change:

{ 
----- basically we replace ROGER with /path/to/folder/for/logs. No docker for this side.
  "log_path": "ROGER/data/log/log.txt",
  "active_tasks_path": "ROGER/data/log/active_tasks.json",
  "completed_tasks_path": "ROGER/data/completed",
  "dataset_map_path": "ROGER/data/log/datasets.json",
  "collection_map_path": "ROGER/data/log/collections.json",
  "status_log_path": "/ROGER/data/log/monitor_status.json",
  "globus": {
    "incoming_files_path": "/home/globus", ----- needs to match above - where to look for Globus files
    "valid_users": { ----- list of whitelisted Globus accounts that can POST to this monitor
        "GLOBUS_USERNAME": {
            "password": "",
            "endpoint_id": "..."
        }, 
        "GLOBUS_USERNAME2": {
            "password": "",
            "endpoint_id": "..."
        }, 
    }
  },
  "api": { ----- IP and port for this service on Roger
    "port": "5454",
    "ip_address": "0.0.0.0"
  },
  "clowder": {
    "host": "http://141.142.168.72:9000/clowder",
    "user_map": { ----- this maps Globus account -> Clowder account that will be owner of files
        "maxzilla": {
            "clowder_user": "mburnet2@illinois.edu",
            "clowder_pass": ""
          },
            "GLOBUS_USERNAME2": {
            "clowder_user": "",
            "clowder_pass": ""
          },
    },
    "primary_space": "..." ----- UUID of a space all things should be added to, if any
  }
}

max-zilla commented 8 years ago

Monitors are currently running, but there won't be much activity over the next couple days while they plant in the field it sounds like. We're going to try a couple FTP tests in the meantime.

Once the basics are confirmed functioning, there are still refinements to be made. One is to make sure the FTP logs are being archived/handled as expected. right now code expects "xferlog", "xferlog-1", "xferlog-2"... for progressively older FTP logs, but the original setting was writing them as timestamps like "xferlog", "xferlog-20160410", "xferlog-20160418", etc. either I would need to adjust my script to intelligently look back across the dates, or the FTP settings need to be adjusted to use the -1, -2 naming convention.

Will create issues for ToDos once the basic mechanism is tested some more.

max-zilla commented 8 years ago

I wrote up the deployment process in a google doc for now: https://docs.google.com/document/d/1TxBOgaJBYd5YA-c0J2B11XU4Vm_jYc1PPgpeSLmAQD4/edit?usp=sharing

max-zilla commented 8 years ago

And the pull request for the new monitor code (in the computing-pipeline repo) is here: https://github.com/terraref/computing-pipeline/pull/83

max-zilla commented 8 years ago

Pipeline has been up and running for about 16 hours now. It's still processing the substantial amount of backlog on Roger so it hasn't yet tried to scan a metadata file - not sure yet if @jdmaloney's permission adjustments will fix that issue. I bumped up max Globus transfer size to 500 files per, and put restrictions on max # of active transfers and pending files at a time to avoid running out of memory on cache like what happened on Wednesday.

I paused the bulk upload script that was processing the older files JD manually moved from cache in April - this is just a dumb spider that crawls all directories as we hadn't had any new data yet, but now that new data is coming in I'll need to refactor to only get the older files and not start re-uploading the files arriving via gantry monitor.

6.8 TB of data so far.

max-zilla commented 8 years ago

@jdmaloney one thing I noticed is that Globus autoactivation in code wasn't working - I was still needing to authenticate the gantry endpoint in the Globus UI periodically. I looked into their source code and realized they will not store the gantry creds for autoactivation - this would be the data_mover creds, not my maxzilla globus account.

To use manual activation I need 3 pieces: user/pass for data_mover (OK), and a 'proxy_chain' cert that is very poorly documented but as I understand it, they have the public key piece already:

{
        'name': 'public_key',
        'DATA_TYPE': 'activation_requirement',
        'required': False,
        'value': '-----BEGINPUBLICKEY-----\nMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA5VQX1Rr26MFo7cyYAInr\n+e5w3tMDz9H6R1T5DFGi0PKTKB7OVPdzHnzeP/Bc00BKKNcJrjM57h1PnjkbsY76\n3GEPlOJUuWCTbzrYhMjreD1pp1BWy7W+pdfjtvNumg2SVTxS0axDUVDaV8mWtqW4\nMVtmKoCjTKxrhTUP3VZ5q3iA+gzxxxpqX5yrjfSVk+QsPhkPQ8yBAJ32CWJI1riu\nzz1FvJw7UMqLa8QrSmA01jCJlsEjLxVfrcoYFB1uofBNg0tr7VNfWe7n7Ed0Iksb\n3faxgNIgr1ta7zQwYd9vhjJh87H0UQmv3Vn9pK3ZCLrqjt8mD8mWLY1HK3z1Mi8L\nLHTvI4xAwmClmi4qxbWH5iKPezP2zMFeSXKFWt7xz6ZIxYSEt8j0CCb6vJLCm3Wu\nFd9HiMeXA341ufaK/8m4FlA2aBPuiIN2ULlYC9BvAzmOhkieUgYwET+53IC89vi+\n60CFd1s1CeRtKSJ3hGso+ztErv6rILBZfZ0Aq2Da8ptqq0JjbmmOW3KYdC+cl3uR\n9VjFRMHhytp3SVQL9RrAv4Gv8pcu1JHeLp/Y7FG0sr//hISbs/o8Yj0mwoYRfE/r\nXc9ntErbytIbJGG3wAHQ2dt/+EqpF/v0yDRIUEub4mOrBZKgXVojH8i2uGt/Y8S6\nG8CZ11hEpf8zHxVKl1rCk3kCAwEAAQ==\n-----ENDPUBLICKEY-----\n',
        'private': False,
        'ui_name': 'ServerPublicKey',
        'type': 'delegate_proxy',
        'description': 'ThepublickeyoftheGOAPIservertouseintheproxycertificatefordelegationtoGO,
        inPEMformat.'
    },
    {
        'name': 'proxy_chain',
        'DATA_TYPE': 'activation_requirement',
        'required': True,
        'value': None,
        'private': False,
        'ui_name': 'ProxyChain',
        'type': 'delegate_proxy',
        'description': 'Aproxycertificateusingtheprovidedpublickey,
        inPEMformat.'
    },

See the last description - "a proxy certificate using the provided public key, in PEM format". I am assuming the public key above was created for data_mover on the gantry, as I did not create/provide it - it was returned by Globus.

So I'm trying to figure out which cert to use, none of the obvious ones I have locally are working but this isn't really my strong suit. I don't quite understand if this proxy cert already exists somewhere, or if it needs to be created using the private key (https://github.com/globusonline/transfer-api-client-python/blob/master/globusonline/transfer/api_client/x509_proxy/m2.py). I can reach out to Globus but thought I'd see if you know where this public key came from.

bd4 commented 8 years ago

@max-zilla The cert does need to be created using the private key (EDIT: the private key of the end entity cert. The private key of the proxy, corresponding to the public key from Globus, is kept on the server). Proxy certificates are certificates 'issued' by a normal end entity cert, instead of a by CA, but the process is similar to a CA issuing a certificate. In this case the client code creates a new certificate using the public key provide by the Globus Transfer API, issued by the credential you provide, and signed by the corresponding private key you provide. It will also only work if the end entity cert you create the proxy from is trusted by the endpoint - the resulting proxy certificate is a proxy for the original certificate, and has the same authentication power. Sorry if that's confusing, there is a lot of X.509 background here and I'm not sure what your familiarity is.

max-zilla commented 8 years ago

Thanks for the quick response, @bd4! I think I understand what's necessary. My familiarity is fairly nonexistent but it's improving, I'll follow up if I have more questions.

dlebauer commented 8 years ago

@max-zilla what is the status of this issue? what tasks remain before this can be closed?

max-zilla commented 8 years ago

Yes, we can close this. The pipeline is functional - I will create a separate issue for the remaining adjustment that might be needed (related to Globus authentication certificates) but not sure if this is necessary yet. Will close shortly.

max-zilla commented 8 years ago

https://github.com/terraref/computing-pipeline/issues/105 new issue here. Closing this one.

dlebauer commented 8 years ago

@max-zilla have the / will the relevant information from this thread be captured in the documenation?

max-zilla commented 8 years ago

@dlebauer yes, I have a very rough draft (mostly copied from here) of a pipeline document in my fork of the terra-ref documentation. Once I make it more presentable we can merge into main docs.

terraref / computing-pipeline

Complete Gantry -> Globus pipeline #79