intel / dffml

The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.
https://intel.github.io/dffml/main/
MIT License
250 stars 138 forks source link

docs: dataflows: Improve docs #1279

Closed pdxjohnny closed 2 years ago

pdxjohnny commented 2 years ago

These are notes and scratch work around the purpose and future of the project.

Mission: Provide a clear, meticulously validated, ubiquitously adopted reference architecture for an egalitarian Artificial General Intelligence (AGI) which respects the first law of robotics.

To do so we must enable the AGI with the ability to act in response to the current system context where it understands how to predict possible future system contexts and understands which future system contexts it wishes to pursue are acceptable according to guiding strategic plans (such as do no harm). We must also ensure that human and machine can interact via a shared language, the universal blueprint.

AI has the potential to do many great things. However, it also has the potential to to terrible things too. Recently there was an example of scientists who used a model that was good a generating life saving drugs, in reverse, to generate deadly poisons. GPU manufacturers recently implemented anti-crypto mining features. Since the ubiquitous unit of parallel compute is a GPU, this stops people from buying up GPUs for what we as a community at large have deemed undesirable behavior (hogging all the GPUs). There is nothing stopping those people from buying for building their own ASICs to mine crypto. However, the market for that is a subset of the larger GPU market. Cost per unit goes up, multi-use capabilities go down. GPU manufacturers are effectively able to ensure that the greater good is looked after because GPUs are the ubiquitous facilitator of parallel compute. If we prove out an architecture for an AGI that is robust, easy to adopt, and integrates with the existing open source ecosystem, we can bake in this looking after the greater good.

As we democratize AI, we must be careful not to democratize AI that will do harm. We must think secure by default in terms of architecture which has facilities for guard rails, baking safety into AI.

Failure to achieve ubiquitous adoption of an open architecture with meticulously audited safety controls will result in further consolidation of wealth and widening inequality.

pdxjohnny commented 2 years ago

By convention operations which have a single output we usually name that output result

pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago

Generic flow (data,work,program) executor

pdxjohnny commented 2 years ago

Why: unikernels Can build smallest possibls attack surface Could even build scilicon / RTL to optimize for sepfic data flow

pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago

Manifest Schema

Manifests allow us to focus less on code and more on data. By focusing on the data going into and out of systems. We can achieve standard documentation of processes via a standard interface (manifests).

Our manifests can be thought of as ways to provide a config class with it's parameters or ways to provide an operation with it's inputs.

References:

Validating

Install jsonschema, and pyyaml python modules

pip install pyyaml jsonschema

This is how you convert from yaml to json

$ python -c "import sys, pathlib, json, yaml; pathlib.Path(sys.argv[-1]).write_text(json.dumps(yaml.safe_load(pathlib.Path(sys.argv[-2]).read_text()), indent=4) + '\n')" manifest.yaml manifest.json

Example below validates, checking status code we see exit code 0 which means success, the document conforms to the schema.

$ jsonschema --instance manifest.json manifest-format-name.0.0.2.schema.json
$ echo $?
0

Writing

Suggested process (in flux)

ADR Template

my-format-name
##############

Version: 0.0.1
Date: 2022-01-22

Status
******

Proposed|Evolving|Final

Description
***********

ADR for a declaration of assets (manifest) involved in the process
of greeting an entity.

Context
*******

- We need a way to describe the data involved in a greeting

Intent
******

- Ensure valid communication path to ``entity``

- Send ``entity`` message containing ``greeting``
pdxjohnny commented 2 years ago

State transition, issue filing, estimating time to close issue, all have to do with having the complete mapping of inputs to problem (data flow). If we have an accurate mapping then we have a valid flow, we can create an estimate that we understand how we created the estimate because we have a complete description of the problem. See also: estimation of GSoC project time, estimation of time to complete best practices badging program activities, time to complete any issue, helps with prioritization of who in an org should work on what, when, to unblock others in the org. Related to builtree discussion.

pdxjohnny commented 2 years ago

We use dataflows because they are a declarative approach which allows you to define different implementations based on different execution environments, or even swap out pieces of a flow or do overlays to add new pieces.

They help solve the fork and pull from upstream issue. When you fork code and change it, you need to pull in changes from the upstream (the place you forked it from). This is difficult to manage with the changes you have already made, using a dataflow makes this easy, as we focus on how the pieces of data should connect, rather than implementations of their connections.

This declarative approach is important because the source of inputs change depending on your environment. For example, in CI you might grab from an environment variable populated from secrets. In your local setup, you might grab from the keyring

pdxjohnny commented 2 years ago

Notes from work in progress tutorial:

We need to come up with serveral metrics to track and plot throughout. We also need to plot in relation to other metrics for tradeoff analysis.

We could also make this like a choose your own adventure style tutorial, if you want to do it with threads, here's your output metrics. We can later show that we're getting these metrics by putting all the steps into a dataflow and getting the metrics out by running them. We could then show how we can ask the orchestrator to optimize for speed, memory, etc. Then add in how you can have the orchestrator take those optimization constriants from dynamic conditions such as how much memory is on the machine you are running on, or do you have access to a k8s cluster. Also talked about power consumption vs. speed trade off for server vs. desktop. Could add in edge constraints like network latency.

Will need to add in metrics API and use in various places in orchestrators and expose to operations to report out. This will be the same APIs we'll use for stub operations to estimate time to completion, etc.

This could be done as an IPython notebook.

pdxjohnny commented 2 years ago

InputNetwork, any UI is just a query off of the network for data linkages. Any action is just a retrigger of a flow. On flow execution end combine caching with central database so that alternate output querys can be run later. Enabling data lake.

pdxjohnny commented 2 years ago

Classes become systems of events (dataflows) where the interface they fit into is defined by contracts (manifests)

pdxjohnny commented 2 years ago

To implement and interface one but satisfy system usage contraints. I.e. must be ready to accept certain events (manifest) and fulfill contract. Might also need to give certain events (inputas manifest)

pdxjohnny commented 2 years ago
pdxjohnny commented 2 years ago

Run whatever you want, wherever you want, however you want, with whatever you want, for whoever you want.

pdxjohnny commented 2 years ago

Hitting Critical Velocity. The fully connected dev model.

pdxjohnny commented 2 years ago

City planning as dataflows plus CI

Imagine you're playing a city simulator. Each building has an architecture and purpose within the architecture of your overall city. Imagine that there are at certain guiding overall strategies which the entities within the city understand must be taken into account to perform any actions they're directed to do. For example, one strategic goal or piece of a strategic plan might be that the city should always collect garbage and there should never be a day where garbage is not collected from more than 75% of the residents. The garbage crews as agents need to know that their course of action in terms of actions they should take or next steps sent by the city should have been vetted by the strategic plan which involves the assurance of residents garbage being picked up at the expected percentage. Entities also make decisions based on data used to train their models in an active learning situation. Data used to train agent action / strategic plans should come only from flows validated by a set of strategic plans, or strategic plans with certain credentials (verified to ensure kill no humans is applicable for this subset of data). This will allow us to add controls on training models, to ensure that their data does not come from sources which would cause malicious behavior, or behavior unaligned with other any active strategic plans. We must also be able to add new strategic plans on the fly and modify the top level strategic decision maker. This example maps too the provenance information we will be collecting about the plans given to agents or inputs given to opimps. This provenance information must include in attestation or valid claim that that certain sets of strategic plans were taken into consideration by the top level strategic decision maker when the orders come down to that agent or opimp.


Optimize for efficiency is post captializm society Map people and what makes them happy and feel good health wise, things they jive with conceptually. This is like how to find the optimal agent to run the job to execute any active strategic plans (model optimization targets), because certain agents or opimps have attributes like attestation abilities which is why we might pick them if one of our active strategic plans is to optimize for hardening for threats within a treat model (effectively alternate mitigations, which show relationship between intent to secure assets or maintain security properties of the system) Using something is a metric that might be accounted for by the strategic optimization models. One hour, one cycle, one instance How to you know what the modifications to the strategy (system context) should be? Predict the outputs plus structured logged metrics based on models of historical data. Run optimization collector flows across all the predicted flow + system context purmutations. Use automl feature engineering to generate new possibilities for system context (alternative input values). Create alternate flows using threat model alternative mitigation implementor which understands optimizing for strategic security goals with regards to assest protection by understanding intent via intuitive shared human machine language : dataflows The models we build on top of the data from the optimization collector flows are the strategic plans. These models effectively are encoder/decoder language translation models with high accuracy as assessed by an arbitrary accuracy scorer (we may need a way to assess the aggregate accuracy as the percentage of good thoughts). Their individual scores tell us within their scorers description of meaning (human scorer in case of allowlist forms). For example with optimizing for security one could output (or raise exception) a value that says this is an absolute veto power moment from that strategy, signifying we should not consider the result of the prediction acute (similar to allowlist set to conditional if crypto detected). There is a top level strategic model which makes the final decisions as to what system contexts will be explored (what thoughts are acted on and what are thought through further) Its almost like we tie the agents in and we are thinking many thoguhts (dataflows) and some thoughts we want to act on and some we want to continue to think about to see how they play out with more throetical paths and maybe even variations on startegic plans used to create those system input contexts. We add back in real data to the training sets as we play our the real paths, strategic plans get weighted by the accuracy of their model by another encoder decoder running in the top level stategic model (this one is different for each person, effort, deployment, engagement) Predict the future with me. Open source AI for a post capitalism society You can satisfy the kill no humans thing by using DICE for device to device attestation where devices only execute orders that can be validated via provenance information metadata of system context inputs applicable to this agent/opimp tied back DID provenance chain to plan which had to he thought of by strategic top level plan which did run with attestation provenance model and accepted its veto with ultimate veto authority. Therefore we know that any plan that would bender mode kill all humans would have been stopped. An architecture for generic artificial intelligence We collectively can figure things out how to organize to achieve goals (business, climate change, etc.) using this architecture as a mechanism for communication

pdxjohnny commented 2 years ago

Alice's Adventures in Wonderland

Blog series

Together we'll build Alice, an Artificial General Intelligence. We'll be successful when Alice successfully maintains a DFFML plugin as the only maintainer for a year. Debugging issues, writing fixes, reviewing code, accepting pull requests, refactoring the code base post PR merge, dealing with vulnerabilities, cutting releases, maintaining release branches, and completing development work in alignment with the plugin's universal blueprint. She will modify, submit pull requests to, and track upstreaming of patches to her dependencies to achieve the cleanest architecture possible. We'll interact with her as we would any other remote developer.

We'll need to build the foundations of Alice's thought processes. Throughout this series, we'll rely heavily on a mental model based on how humans think and problem solve. By the end of this series we'll have ensured Alice has all the primitive operations she requires to carry out the scientific process.

Terminology

Expectations

Alice is going to be held to very high standards. We should expect this list to grow for a long time (years). This list of expectations may at times contain fragments which need to be worked out more and are only fragment so the ideas don't get forgotten.

Alice's Understanding of Software Engineering

We'll teach Alice what she needs to know about software engineering though our InnerSource series. She'll follow the best practices outlined there. She'll understand a codebase's health in part using InnerSource metric collectors.

pdxjohnny commented 2 years ago

What we end up with is a general purpose reinforcement learning architecture. This architecture can be feed any data and make sense of how the data relates to it's universal blueprint. The trained models and custom logic that form its understanding of how the data relates to it's universal blueprint are it's identity. As such, our entity named Alice will be trained on data making her an open source maintainer.

We'll show in a later series of blog posts how to create custom entities with custom universal blueprints (strategic goals, assets at their disposal, etc.). Entities have jobs, Alice's first job is to be a maintainer. Her job is reflected in her universal blueprint, which will contains all the dataflow, orchestration configs, dataflows used to collect data for and train models used in her strategic plans, as well as any static input data or other static system context.

We can save a "version" of Alice by leveraging caching. We specify dataflows used to train models which are then used in strategic plans. Perhaps there is something here with on dataflow instantiation query the inodes from shared config and sometimes a config will be defined by the running of a dataflow which will itself consume inputs or configs from other inodes within shared config. So on dataflow instantiation find leaf nodes in terms of purely static plugins to instantiate within shared configs region. This shared config linker needs to have access to the system context. For example if the flow is the top level flow triggered from the CLI then the system context should contain all the command line arguments somewhere within it's input network (after kick off or when looking at a cached copy from after kick off). Definition a plugin can be done by declaring it will be an instance where the config provided is from the output of a dataflow. That dataflow can be run as a subflow with a copy on write version of the parent system context (for accessing things like CLI flags given). There could be an operation which is runs an output operation dataflow on the CoW parent system context. That operations output can then be formed into it's appropriate place in the config of the plugin it will be used to instantiate.

We will of course need to create a dependency graph between inodes. We should support requesting of re-instantiation of instances within shared configs via event based communication to strategic decision maker. Configuration and implementation of the strategic decision maker (SDM) determine what active strategic plans are taken into account. The SDM must provide attested claims for each decision it makes with any data sent over potentially tamperable communication channels (needs understanding of properties of use of all instances of plugins, for example in memory everything is different than and operation implementation network connected over the internet).

pdxjohnny commented 2 years ago

Song for talk: https://www.azlyrics.com/lyrics/jeffersonairplane/whiterabbit.html

pdxjohnny commented 2 years ago

Serializable graph data structure with linkage, can be used for "shared config", just add another property like an inode to the plugin, config baseconfigurable code in dffml.base. Then populate configs based off instantiated plugins with inodes in shared_configs section.