plantinformatics / pretzel

Javascript full-stack framework for Big Data visualisation and analysis
GNU General Public License v3.0
43 stars 13 forks source link

a simple architecture for interfacing to backend pipelines #126

Open Don-Isdale opened 5 years ago

Don-Isdale commented 5 years ago

In a number of areas we have discussed, it would be be useful to have datasets which are calculated from other (raw) data, on demand, and cached for a reasonable time. Adding data on which the calculation had a defined dependency would prompt a re-calculation. A simple way to build this would be using make as the target interpreter. This can be fronted by something like http://websocketd.com/ for access from the loopback node server. The calculation can be made available to users by defining dataset+blocks which don't contain data, but have meta.target, e.g. : {function : "clari-TE", dataPath : "wheat/CS/Chris/7A", ..}. Probably the dataset would identify the data pipeline, and the block would identify the data files and ranges. When requested by the frontend, the server would use make to interpret the target and translate it into actions which result in data, which is then sent as reply to the request. This could of course be done in JavaScript, but make already has all the dependency management, and is a simple interface to all the shell-based data processing languages/tools - Python / Perl / Awk / Shell / C++ / R / etc. The pipelines can use e.g. jq (stedolan.github.io/jq/) for converting tabular results to json. The calculated results can be held as files, and retired on a time basis, e.g. time since last read. The make rule can easily launch commands in containers (e.g. BioConda containers) or cluster queues (e.g. slurm), and can launch multiple jobs in parallel (--jobs) and manage the dependencies. This approach would be very easy for users who are self-hosting Pretzel to configure their own pipelines. Using make as the target interpreter provides an abstraction layer which insulates the backend from the requests being sent through the API; accepting shell commands from the frontend would not be viable approach because of security. The targets are a natural, easily defined, abstraction, which associates targets with dependencies and actions. This architecture would provide reactive lazy-evaluation and lazy-loading deriving from raw data in response to the user's GUI actions, e.g. zooming in will request the next level of detail.

make is installed on every system by default, which is an attraction. There is a little-known alternative to make (github.com/apenwarr/redo) which is interesting and may have benefits, but gnu make has some powerful abstractions and features.

Don-Isdale commented 5 years ago

from webex discussion (Rad, Kieran, Don) : Rad mentioned 2 alternatives to make which include integration with bioinformatics workflows : . Snake-Make (https://snakemake.readthedocs.io/en/stable/) interfaces to BioConda . NextFlow (https://www.nextflow.io/) interfaces to containers, cluster queues, and to cloud platforms. NextFlow would make it easy to extend the architecture to include cluster and cloud processing. NextFlow is sometimes wrapped with make to provide target interpretation and time-based dependencies. This approach may suit this application. Kieran mentioned upload of files from the front-end, and Rad noted this would be useful when users had a DNA sequence which was not matched in the loaded data and they wanted to search against other datasets outside that in their Pretzel instance.

rsuchecki commented 5 years ago

Just remembered I've got these slides available

https://rsuchecki.github.io/nextflow_intro/nf-intro.html

this only scrapes the surface but may be a good starting point