hestiaAI / hestialabs-experiences

HestiaLabs Data Experiences & Digipower Academy
https://digipower.academy
Other
7 stars 1 forks source link

Overview of desirable features (experience builder, search, sharing...) #571

Closed andreaskundig closed 2 years ago

andreaskundig commented 2 years ago

Overview

This issue tries describes a "vision" of the features we might want to develop. It tries to summarize and integrate discussions with Paul-Olivier, Charles, François, Andreas, Hugo and Thomas.

It is structured around the experience builder, which seemed the most important goal at first, but includes thoughts about how to incorporate Hugo and Thomas' work, and other ideas.

The experience builder allows an animator external to hestia to set up experiences for a workshop with little help from hestia and the least possible amount of coding.

Defining an experience involves:

  1. finding relevant data (and storing its location with accessors)
  2. defining visualizations
  3. defining the data export
  4. exporting the experience as a json (for sharing or deploying online)

A cross-cutting concern is

  1. obfuscating the data

We conclude on implementation thoughts

  1. possible next steps

1. Finding relevant data

1.a The file explorer

Files are in a tree and can be opened in a file viewer that shows you "nodes" of data that each have an accessor

1.b Search

Look in files (all, or a subset) for occurrences of a string, a regex. Restrict results by date and location ranges. (See how Thomas restricts dates and locations).

The subset of files being searched could be defined by

(See implementation of search by type in the generic date/location viewers).

Search results are displaye in a table, dates/location could additionally be displayed as a timeline/map

[ ] value json path File name tags
[X] foo bar.baz[3] a.json François, Twitter
[ ] 2022-03-08 foo.b cd.csv Andréas, Facebook
[X] {lat: 123, long: 234, val: "mim"} bafoo[5] bc.json Saddam , Tinder

Actions on each search result:

1.c "Data Models"

A data model pairs accessors with annotations and is defined in Hugo's json-ld format. It could be displayed as a tree, or a table

How would the model interact with search ?

(Can we find a more specific name than data model ?)

data models will probably be stored in a django server with which experiences can interact

Actions on data models

Actions on rows/nodes

1.d Storing accessors

Any data you find has an accessor. Possible mechanisms to store an accessor for use elsewhere:

2. Visualizations

To add a visualization:

Sometimes some of these steps will be automatable (like the table button on json viewer nodes).

3. Data export

A wizard allows to build a consent form. A first version could be just editing the json we currently use to define the consent and what data is exported.

During the workshop we need mechanisms for sharing data

4. Export/import experiences as json

Any visualization (or tab, or block) of an experience is completely defined by a json that can be downloaded. Uploading the json adds a tab to an experience.

An experience (or manifest) consisting of a configuration and several blocks, should up/downloadable in the same way.

The same goes for a complete website.

These json configurations could also be used to create a website, permanently add an experience to a website, or add a visualization to an experience.

We might want to automate the creation of experiences/websites somehow. This could require a django server, or could be based on the git/github api in a similar way as netlifyCMS.

This download/upload functionality will not be available for the custom experiences that use dedicated pipeline or visualization code.

5. Filter/Anonymize/obfuscate data

We see two methods for filtering data before it is shared, (also see Hugo's comment in french below)

Method 1

The filter is hardcoded in the experience manifest and its results are displayed in a tab. In the consent form, it appears as one of the visualizations you can choose to share. It's configured as custom pipeline where this options are a JSON-LD data model.

Method 2

The filter is used in an sql pipeline (in what would correspond to a manifest's database.js).

This allows a participant to reveal less about himself when sharing his data, for example by making locations less precise (Thomas is working on things like these), or omitting some data points

Thomas' anonymization takes specific data from the users files, stores them as tables in a database and modifies the database. Anonymization of one table can depend on another table (if a user wants to hide anything he did in Greece, find the dates when he is in Greece in table A, delete everything at these dates in table B).

Note that the filtering would not apply to visualizations defined in terms of accessors as imagined in 2. Instead of accessors an sql select could be used to define a table, for which a visualization could then be created.

6. possible next steps

For a first minimal way to configure a new experience on the website we need:

Advanced experience configuration

Search Improvements

Data sharing

Integration of Hugo's and Thomas' work

andreaskundig commented 2 years ago
TODO: take the following into consideration: PO-presentation-bubble-semantics-filter-2022-03-11 symbol
C collective
T1 workshop
B1-4 bubbles
R1 bubble of a researcher
M Map, includes semantics and data
S(o,r) Semantic schema of objects and relations (json-ld that contains accessors, annotations, relations)
d data
v visualization
f filter
E experience
x company (like Twitter)

Among the things to add to the issue:

Amustache commented 2 years ago

I started to formally compile our models here. Feel free to complete.

As a reminder, the aim of the lore repository is to be a knowledge database that helps to have coherence between all projects.

Amustache commented 2 years ago

Possibility to create (relatively) coherent data from models could be interesting

Amustache commented 2 years ago

Okay, in there I will try to formalize a pipeline for the "data model" creation

Big picture: data semantics and data model

Definitions

In both cases of data (unknown and known), the data contains specific data (e.g., a specific ID, a specific name, ...) - the difference is in the lack of description of this data.

Model

This is linked to OP's "1.c "Data Models"" An example of what a model should contain could be the following:

File name Field name Endpoint Type Description
account-creation-ip.js accountId account_creation_ip.accountCreationIp.accountId Integer Unique account ID for that user.

It is very important to allow as many users as possible to benefit from the description of the data. Thus, the semantics should contain at least a high-level textual description of the endpoint (i.e., so that non-tech-savvy users can still understand), and a low-level formal description of the endpoint (i.e., so that a script can exploit the data).

Eventually, an example of the type of data might be useful to add directly into the model, but this is up for discussion.

An example of what a model should look like could be the following:

{
  "fileName": "account-creation-ip.js",
  "filePath": "account-creation-ip.js",
  "fileFormat": "application/javascript",
  "description": "What IP was used to create that account.",
  "@graph": [
    {
      "@type": "Integer",
      "fieldName": "accountId",
      "unique": "True",
      "fieldPath": "accountCreationIp/accountId",
      "description": "Unique account ID for that user."
    },
  ]
},

Of course, these are examples, and the exact premises shall be worked on and iterated upon, again and again.

Pipelines

There are two cases of use of the pipeline: the case of someone wanting to describe (new) models, and the case of someone wanting to use a model.

  graph TD;
      Data_source-->Unknown_data;
      Data_source-->Known_data;
      Unknown_data-->Create_new_model;
      Known_data-->Enhance_existing_model;
      Create_new_model-->Model;
      Enhance_existing_model-->Model;
      Model-->Use_model;
      Known_data-->Use_model;
      Use_model-->Analysis;
      Use_model-->Visualisation;
      Use_model-->Research;
      Use_model-->...;

In both cases, the idea is to allow the model to be augmented as the data is described. Ideally, the end user (i.e. the person who wants to use the data) also has the possibility to complete or amend a model with new elements.

Basically, what is wanted is the following:

  graph TD;
      I-->II;
      II-->II;
      II-->III;
      III-->II;

I. Creating model from unknown data

  graph TD;
      Unknown_data-->JSON_format;
      JSON_format-->JSON_LD;
      JSON_LD-->Model;
  1. Parser: First, an unknown source is converted into a JSON format to make it easier to use. In most cases, this conversion will be trivial, but difficulties may exist.
  2. Description: The JSON file is supplemented with semantics (=descriptions), using the JSON-LD format.
  3. Anonymisation: We remove all the specific data, to keep only the description, and obtain in the end a JSON-LD agnostic file.

II. Enhancing model from known data

  graph TD;
      Known_data-->JSON_format;
      JSON_format-->JSON_LD;
      JSON_LD-->Model;

The operations are identical, except that the description will have to augment/correct the existing model.

III. Using a model

This may be linked to OP's "2. Visualizations"

Once a template is ready, it should be able to be used in many different cases. The advantage of having a standardised format is that it will be easy to transcribe the information from the template to other existing elements (e.g., pandas, ...). However, it is worth bearing in mind that you will need to create appropriate conversion tools, depending on your needs.

cont'd

Amustache commented 2 years ago

@pdehaye : "Le but n'est pas forcément que la personne qui a la data puisse tout faire sans programmation, c'est davantage que la personne qui a la data puisse trouver quelqu'un avec un ensemble de compétences (programmation) pour arriver au but qu'il souhaite."

Amustache commented 2 years ago

Constraints and steps

Step 0: Getting the data

This step is facultative but can be a good entry point.

The first part of obtaining data to exploit is to provide easy access to it. There are websites, such as JustGetMyData, which already reference methods for retrieving data, but a dedicated portal allowing the user to query the available data may be interesting. We can add more information, like for instance from JustWhatsTheData, so that the portal is complete.

An example of a prototype can be found on the following picture:

image

From this portal, we can propose to people, after obtaining their data, to analyse them, locally, and potentially share their results with a collective or more.

Step 1: Loading tool for data tagging

flowchart LR
A(Receiving data)-->F{Need converting to JSon?}
F-->|Yes| G{Existing converter?}
F-->|No| H{Existing model?}
G-->|Yes| B[Load converter]
G-->|No| I(Redirect to Issues/PR/form)
B-->C[Convert to JSon]
C-->H
H-->|Yes| D[Load model]
H-->|No| E[Create model]
D-->J[Tagging data]
E-->J

Step 2: Loading model and data

For this step, it will be necessary to check at the level of the needs to have conversions of the model according to the needs.

pdehaye commented 2 years ago

@pdehaye : "Le but n'est pas forcément que la personne qui a la data puisse tout faire sans programmation, c'est davantage que la personne qui a la data puisse trouver quelqu'un avec un ensemble de compétences (programmation) pour arriver au but qu'il souhaite."

A noter que ce quelqu'un peut etre Hestia.ai mais dans un autre mode: non plus comme orchestrateur mais comme utilisateur du service de l'orchestrateur pour receptionner des datas brutes et a partir de la les structurer ("schematiseur"). Cela montre a mon avis que nous devrions nous a l'interne d'abord nous structurer nous-memes autour des APIs qu'on voudrait externaliser.

Cela nous permettrait dans un deuxieme temps de nous positionner de maniere efficiente en mode embedded aupres des POCs qu'on pousse d'abord.

Amustache commented 2 years ago

Discussion du jour :

Subtasks

The idea is to have something on a really high specification level, that will generate Django models, that can be used to generate whatever afterwards + can be used with JS.

Sidenote, maybe pin that issue, as it is kind of a "master thread" for the next steps.

Amustache commented 2 years ago

Discussion avec @andreaskundig @wilderino @Amustache

Méthodes de filtrage

Méthode 1 (Filtre défini par défaut en amont)

Ici, l'idée est d'appliquer un filtre avant d'utiliser les données et de les transformer en table.

Méthode 2 (Application des filtres à la création de la base de données)

Ici, l'idée est d'appliquer un filtre après avoir créé la table globale dans expérience.

Étapes (mix'd)

  1. Définir le datamodel qui définit les données à utiliser.
  2. Passer du datamodel à une table.
  3. Par-dessus, avoir un filtre à définir.
Amustache commented 2 years ago

I forgot to reference this here, but https://github.com/hestiaAI/tools/blob/main/ApplePrivacyReport/mockup.ipynb may be relevant to that discussion.