Cylc Questions - Githubissues

kinow commented 6 years ago

[ ] How are the contact file and sqlite files used?
[x] Are the suite.rc files, ConfigObj files?
[ ] How does Cylc's communication model compare to:
- [x] JupyterHub
- [ ] Hadoop/Spark
- [ ] Apache Airflow
- [x] Luigi?

kinow commented 6 years ago

Luigi

Different than Cylc, Luigi does not contain a scheduler. One should still be able to use an external scheduler, such as cron, to schedule tasks with Luigi.

It contains a dependency management, and tasks that can execute commands via SSH.

Different than Cylc, there is no DSL or even an external file in YAML or JSON. The tasks are configured via Python code.

When you execute a task in Luigi, it will be looking at the output folders and all dependencies to see whether the task is completed or not.

Running luigid --pidfile /tmp/luigid/pid --logdir /tmp/luigid/ --state-path /tmp/luigid/state starts the daemon with the Tornado web application, or "Central Scheduler". The central scheduler is responsible for (quoting docs):

Make sure two instances of the same task are not running simultaneously
Provide visualization of everything that’s going on.

When you start the server, it will contain just the log file.

luigid-fullpage

The dependency graph is simple, but contains colours that can quickly indicate the status of each task.

It is possible to run the foo task from the examples with the following line to see a task in luigid: PYTHONPATH="." luigi --module examples.foo examples.Foo --workers 2 --scheduler-host localhost --scheduler-port 8082

Note that when one submits the task, it must indicate the scheduler (by default it uses localhost:8082). From what I understood from the documentation and quick read at the code, the task running locally will communicate with the scheduler to send information about its status.

Here's an example from the documentation, the on_success from the Task class.

By default the /history/ endpoint will return an error 500. But if you enable the history feature, then a page similar to Hadoop's old job history will be displayed.

The scheduler must be started in a folder with luigi.cfg (you can use other locations like /etc/luigi/client.cfg), with something similar to:

#File: luigi.cfg
[scheduler]
record_task_history = True
state_path = /tmp/luigid/luigi-state.pickle

[task_history]
db_connection = sqlite:////tmp/luigid/luigi-task-hist.db

The database above contains the following schema.

$ sqlite3 luigi-task-hist.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> .
sqlite> .schema
CREATE TABLE tasks (
    id INTEGER NOT NULL, 
    task_id VARCHAR(200), 
    name VARCHAR(128), 
    host VARCHAR(128), 
    PRIMARY KEY (id)
);
CREATE INDEX ix_tasks_name ON tasks (name);
CREATE INDEX ix_tasks_task_id ON tasks (task_id);
CREATE TABLE task_parameters (
    task_id INTEGER NOT NULL, 
    name VARCHAR(128) NOT NULL, 
    value TEXT, 
    PRIMARY KEY (task_id, name), 
    FOREIGN KEY(task_id) REFERENCES tasks (id)
);
CREATE TABLE task_events (
    id INTEGER NOT NULL, 
    task_id INTEGER, 
    event_name VARCHAR(20), 
    ts TIMESTAMP NOT NULL, 
    PRIMARY KEY (id), 
    FOREIGN KEY(task_id) REFERENCES tasks (id)
);
CREATE INDEX ix_task_events_ts ON task_events (ts);
CREATE INDEX ix_task_events_task_id ON task_events (task_id);

obs: the idea of local-scheduler, or submitting the job to a central scheduler could be a good choice for the implementation in Cylc

kinow commented 6 years ago

Parsec

Parsec is the parser within Cylc. It resembles ConfigObj. Also looks similar to INI files. It is actually a (quoting) INI-style config file.

Here's more or less an ASCIIFlow diagram. The user will have an interface/function, which can be a class in Python (e.g. RawSuiteConfig in suite.py).

That interface will call the parser, which reads and process the file with Jinja, does some inlining, and later passes it over to the Validators. Validators will both check the validity of the data, and also will coerce data.

     +------------+                 +--------------+
     |  Parsing   +---------------> |  Validation  |
     +-----+------+                 +----+---------+
           ^                             |
           |                             |
           |                             |
           |                             |
+----------------+                       |         +->  +----------------+
| +------------+ |                       |         |    |  Deprecated    |
| |            | |          +------------v-----+   |    +----------------+
| |  Function  | |          |File Specification|   |    +----------------+
| | Interface  | |          |  (a dictionary)  +----->  |  Obsolete      |
| |            | |          +------------------+   |    +----------------+
| +------------+ |                                 |    +----------------+
+----------------+                                 |    |  Upgrade       |
          ^                                        +->  +----------------+
          |
          |
          |
          |
    +-----+------+
    |            |
    |            |
    |   USER     |
    |            |
    |            |
    +------------+

Another feature of the parser are the obsolete/deprecated support. If the syntax changes with releases of Cylc, the parser will support old and new syntax, raising issues when necessary, telling users what changed, and using the values correctly.

Here's a quick list of things happening when the suite.rc file is parsed.

read parsec file (i.e. suite.rc)
if there's a child folder ./lib/python, it will be included when processing Jinja files, so users can have extra Jinja filters & functions
split the text by \n, creating an array
inline it (i.e. any included file)
process with Jinja2
concatenate continuous lines (e.g. some_cmd \)
write processed file suite.rc.processed
discard comments
remove blank lines
if a header, check the balance of ['s, and parse it
if key=value, then process it adding the to parent/current node
check upgrade/deprecated/obsolete/etc
finally validate it against the spec

The spec is a multi-level dictionary in Python, which will tell for each level, what's allowed. e.g.

dict {
 'parent1': {
    'name': #some_validator,
    'children': { ... }
}

Which would be used for a file like

[parent1]
name: ...

[[ children ]]
...

As far as I could tell, looks like indentation is merely aesthetics. Users can use whatever tabs/spaces they like. The parser/validator rely heavily on regular expressions, and apply \s which evaluates to space or tabs.

hjoliver commented 6 years ago

Good spotting - we did in fact use ConfigObj in the early days of Cylc (although not the really early days, when cylc had no config file at all, just separate task definitions). But eventually we ran into limitations of ConfigObj and I wrote my own stripped-down, faster, parser (I am not very proud of the code, but it works, and is faster, ... but in future we want a Python API anyway).

kinow commented 6 years ago

@hjoliver thanks! I was looking for the grammar for Cylc's configuration files, to see if I could quickly create a custom editor with autocomplete for Eclipse. Syntax highlighting is OK, but I am still losing a lot of time looking at other suites and/or at the CUG.

If I manage to find more time, besides being able to use the editor in Eclipse & PyDev (my favorite environments for Java/Python), it would be possible to use the same source to produce a web editor with autocomplete :-) all done through Xtext, of course.

When I manage to get a working example, I will share in the mailing list to see if others would be interested on this too.

hjoliver commented 6 years ago

The "grammar" is just key = value with nested headings, and items defined by lib/cylc/cfgspec/suite.py (as you've noted). A clearer way to see all allowed items is cylc get-suite-config <suite>. Parsing the content of graph string values is more difficult ... maybe that could be done better with Xtext too...

kinow commented 5 years ago

JupyterHub

JupyterHub starts up a Node.js reverse proxy. The Hub is still managing the authentication and routing. So when you request a Jupyter Notebook (e.g. something like https:///kinow/notebook-cylc), the Hub starts by checking that you have the right credentials in your request.

If authentication and authorizatino pass, the next step is to start up the Jupyter server. This server is the same you get when you run jupyter notebook, but will be controlled by the Hub, who will keep track of the server.

Within the server, the Jupyter Notebook HTTP layer communicates with a Kernel (normally Python, but can be any language) via ZeroMQ. That's the only part where HTTP is not used (quite sure?).

Once JupyterHub authenticated and started (spawned is probably the right term) the Jupyter Notebook, the next requests are handled transparently by the Node.js reverse proxy.

JupyterHub will keep track of metrics, and expose them in a REST endpoint for Prometheus. Which is useful for monitoring.

JupyterHub supports multiple authentication mechanisms (OAuth, Linux PAM, dummy, user+pass from a DB, etc). And also provides multiple spawners, like local, external, docker, etc.

kinow / kinoshita.eti.br

Cylc Questions #81

Luigi

Parsec

JupyterHub