andreaskundig commented 2 years ago

Overview

This issue tries describes a "vision" of the features we might want to develop. It tries to summarize and integrate discussions with Paul-Olivier, Charles, François, Andreas, Hugo and Thomas.

It is structured around the experience builder, which seemed the most important goal at first, but includes thoughts about how to incorporate Hugo and Thomas' work, and other ideas.

The experience builder allows an animator external to hestia to set up experiences for a workshop with little help from hestia and the least possible amount of coding.

Defining an experience involves:

finding relevant data (and storing its location with accessors)
defining visualizations
defining the data export
exporting the experience as a json (for sharing or deploying online)

A cross-cutting concern is

obfuscating the data

We conclude on implementation thoughts

possible next steps

1. Finding relevant data

1.a The file explorer

Files are in a tree and can be opened in a file viewer that shows you "nodes" of data that each have an accessor

1.b Search

Look in files (all, or a subset) for occurrences of a string, a regex. Restrict results by date and location ranges. (See how Thomas restricts dates and locations).

The subset of files being searched could be defined by

search by file name
file selection by checkbox
tag that can be set at upload (company, person)

(See implementation of search by type in the generic date/location viewers).

Search results are displaye in a table, dates/location could additionally be displayed as a timeline/map

[ ]	value	json path	File name	tags
[X]	foo	bar.baz[3]	a.json	François, Twitter
[ ]	2022-03-08	foo.b	cd.csv	Andréas, Facebook
[X]	{lat: 123, long: 234, val: "mim"}	bafoo[5]	bc.json	Saddam , Tinder

Actions on each search result:

store accessor (see below)
display in file viewer

1.c "Data Models"

A data model pairs accessors with annotations and is defined in Hugo's json-ld format. It could be displayed as a tree, or a table

How would the model interact with search ?

keep them completely independent
ability to run the search only on locations described in the model.

(Can we find a more specific name than data model ?)

data models will probably be stored in a django server with which experiences can interact

Actions on data models

upload and download as json-ld.
create and delete

Actions on rows/nodes

store accessor (see below)
display data in appropriate file viewer
display and edit annotations
add new row/node (using an accessor)

1.d Storing accessors

Any data you find has an accessor. Possible mechanisms to store an accessor for use elsewhere:

copy to clipboard
add to a global list that populates input field dropdowns

2. Visualizations

To add a visualization:

click the "create visualization" (this could create a new tab)
choose one among a list of available visualizations
fill a form with the necessary accessors and configuration (possibly config file like for kepler)

Sometimes some of these steps will be automatable (like the table button on json viewer nodes).

3. Data export

A wizard allows to build a consent form. A first version could be just editing the json we currently use to define the consent and what data is exported.

During the workshop we need mechanisms for sharing data

participants send data to animator to visualize aggregates
animator sends data to collective to aggregate even more

4. Export/import experiences as json

Any visualization (or tab, or block) of an experience is completely defined by a json that can be downloaded. Uploading the json adds a tab to an experience.

An experience (or manifest) consisting of a configuration and several blocks, should up/downloadable in the same way.

The same goes for a complete website.

These json configurations could also be used to create a website, permanently add an experience to a website, or add a visualization to an experience.

We might want to automate the creation of experiences/websites somehow. This could require a django server, or could be based on the git/github api in a similar way as netlifyCMS.

This download/upload functionality will not be available for the custom experiences that use dedicated pipeline or visualization code.

5. Filter/Anonymize/obfuscate data

We see two methods for filtering data before it is shared, (also see Hugo's comment in french below)

Method 1

The filter is hardcoded in the experience manifest and its results are displayed in a tab. In the consent form, it appears as one of the visualizations you can choose to share. It's configured as custom pipeline where this options are a JSON-LD data model.

Method 2

The filter is used in an sql pipeline (in what would correspond to a manifest's database.js).

A JSON-LD data model defines db tables an how they are populated from the uploaded data files (including foreign keys)
A filter, defined in some JSON format, is applied to the db tables
Every sql visualization queries the filtered tables. Only these tabs are affected by the filter at export.
A user interface allows to change the filter

This allows a participant to reveal less about himself when sharing his data, for example by making locations less precise (Thomas is working on things like these), or omitting some data points

Thomas' anonymization takes specific data from the users files, stores them as tables in a database and modifies the database. Anonymization of one table can depend on another table (if a user wants to hide anything he did in Greece, find the dates when he is in Greece in table A, delete everything at these dates in table B).

Note that the filtering would not apply to visualizations defined in terms of accessors as imagined in 2. Instead of accessors an sql select could be used to define a table, for which a visualization could then be created.

6. possible next steps

For a first minimal way to configure a new experience on the website we need:

[ ] generic charts (bar, pie, kepler) with configuration interface (first just for json viewer)
[ ] copy accessor to clipboard on json-viewer
[ ] accessors for csv data, with clipboard copy (so we can use generic charts)
[ ] experience-specific consent form (currently there can be only one per website I think)
[ ] json manifest for visualization blocks (one per viz type): download/upload/deploy
[ ] json manifest for consent block: download/upload/deploy

Advanced experience configuration

[ ] wizard to define consent form
[ ] multiple selection of accessors (an improvement over copy paste)
[ ] json manifest for experience (several blocks) (download/upload/deploy)
[ ] json manifest for website (several experiences) (download/upload/deploy)
[ ] automated creation of websites (requires some kind of server)

Search Improvements

[ ] search text/regex in all files
[ ] search all dates/locations in all files (move generic viewers to file search)
[ ] search date/location ranges in all files
[ ] search in file set (defined by search on name)
[ ] define file set through checkboxes
[ ] define file set through tags set at upload (company, user)

Data sharing

[ ] explore peer to peer sharing between workshop participants and animator (webrtc?)
[ ] explore data aggregation by collective (database, django?)

Integration of Hugo's and Thomas' work

[ ] "data model" (currently the json-ld format is still changing)
[ ] Obfuscation (currently a work in progress)

andreaskundig commented 2 years ago

TODO: take the following into consideration:	symbol
C	collective
T1	workshop
B1-4	bubbles
R1	bubble of a researcher
M	Map, includes semantics and data
S(o,r)	Semantic schema of objects and relations (json-ld that contains accessors, annotations, relations)
d	data
v	visualization
f	filter
E	experience
x	company (like Twitter)

Among the things to add to the issue:

enriching data before sharing or before aggregating (something less primitive than tags) to allow attributing data in aggregate to original data
adding/sharing semantic data (for example from participant B2 to animator B3, or from researcher R1 back to B2-3)
using semantic data to reshape data before it arrives in a table, or before it is shared
- filter data (list of types, like uber drivers defined in json-ld),
- with joins (a more generic version of our sql pipelines)

Amustache commented 2 years ago

I started to formally compile our models here. Feel free to complete.

As a reminder, the aim of the lore repository is to be a knowledge database that helps to have coherence between all projects.

Amustache commented 2 years ago

Possibility to create (relatively) coherent data from models could be interesting

Amustache commented 2 years ago

Okay, in there I will try to formalize a pipeline for the "data model" creation

Big picture: data semantics and data model

Definitions

Data model: a list of relations between endpoints and semantic. Basically, given a JSON file with data from a specific, known source, the data model helps to describe the data with semantic, i.e., "what each endpoint is" (e.g., "this is a date", or "this is a channel description").
Unknown data := Data for which a data model does not exist yet, i.e., the data cannot be described.
Known data := Data for which a data model exists, i.e., we can describe (part of) that data.

In both cases of data (unknown and known), the data contains specific data (e.g., a specific ID, a specific name, ...) - the difference is in the lack of description of this data.

Model

This is linked to OP's "1.c "Data Models"" An example of what a model should contain could be the following:

File name	Field name	Endpoint	Type	Description
`account-creation-ip.js`	`accountId`	`account_creation_ip.accountCreationIp.accountId`	Integer	Unique account ID for that user.

It is very important to allow as many users as possible to benefit from the description of the data. Thus, the semantics should contain at least a high-level textual description of the endpoint (i.e., so that non-tech-savvy users can still understand), and a low-level formal description of the endpoint (i.e., so that a script can exploit the data).

Eventually, an example of the type of data might be useful to add directly into the model, but this is up for discussion.

An example of what a model should look like could be the following:

{
  "fileName": "account-creation-ip.js",
  "filePath": "account-creation-ip.js",
  "fileFormat": "application/javascript",
  "description": "What IP was used to create that account.",
  "@graph": [
    {
      "@type": "Integer",
      "fieldName": "accountId",
      "unique": "True",
      "fieldPath": "accountCreationIp/accountId",
      "description": "Unique account ID for that user."
    },
  ]
},

Of course, these are examples, and the exact premises shall be worked on and iterated upon, again and again.

Pipelines

There are two cases of use of the pipeline: the case of someone wanting to describe (new) models, and the case of someone wanting to use a model.

  graph TD;
      Data_source-->Unknown_data;
      Data_source-->Known_data;
      Unknown_data-->Create_new_model;
      Known_data-->Enhance_existing_model;
      Create_new_model-->Model;
      Enhance_existing_model-->Model;
      Model-->Use_model;
      Known_data-->Use_model;
      Use_model-->Analysis;
      Use_model-->Visualisation;
      Use_model-->Research;
      Use_model-->...;

In both cases, the idea is to allow the model to be augmented as the data is described. Ideally, the end user (i.e. the person who wants to use the data) also has the possibility to complete or amend a model with new elements.

Basically, what is wanted is the following:

  graph TD;
      I-->II;
      II-->II;
      II-->III;
      III-->II;

I. Creating model from unknown data

  graph TD;
      Unknown_data-->JSON_format;
      JSON_format-->JSON_LD;
      JSON_LD-->Model;

Parser: First, an unknown source is converted into a JSON format to make it easier to use. In most cases, this conversion will be trivial, but difficulties may exist.
Description: The JSON file is supplemented with semantics (=descriptions), using the JSON-LD format.
Anonymisation: We remove all the specific data, to keep only the description, and obtain in the end a JSON-LD agnostic file.

II. Enhancing model from known data

  graph TD;
      Known_data-->JSON_format;
      JSON_format-->JSON_LD;
      JSON_LD-->Model;

The operations are identical, except that the description will have to augment/correct the existing model.

III. Using a model

This may be linked to OP's "2. Visualizations"

Once a template is ready, it should be able to be used in many different cases. The advantage of having a standardised format is that it will be easy to transcribe the information from the template to other existing elements (e.g., pandas, ...). However, it is worth bearing in mind that you will need to create appropriate conversion tools, depending on your needs.

cont'd

Amustache commented 2 years ago

@pdehaye : "Le but n'est pas forcément que la personne qui a la data puisse tout faire sans programmation, c'est davantage que la personne qui a la data puisse trouver quelqu'un avec un ensemble de compétences (programmation) pour arriver au but qu'il souhaite."

Amustache commented 2 years ago

Constraints and steps

Step 0: Getting the data

This step is facultative but can be a good entry point.

The first part of obtaining data to exploit is to provide easy access to it. There are websites, such as JustGetMyData, which already reference methods for retrieving data, but a dedicated portal allowing the user to query the available data may be interesting. We can add more information, like for instance from JustWhatsTheData, so that the portal is complete.

An example of a prototype can be found on the following picture:

From this portal, we can propose to people, after obtaining their data, to analyse them, locally, and potentially share their results with a collective or more.

Step 1: Loading tool for data tagging

flowchart LR
A(Receiving data)-->F{Need converting to JSon?}
F-->|Yes| G{Existing converter?}
F-->|No| H{Existing model?}
G-->|Yes| B[Load converter]
G-->|No| I(Redirect to Issues/PR/form)
B-->C[Convert to JSon]
C-->H
H-->|Yes| D[Load model]
H-->|No| E[Create model]
D-->J[Tagging data]
E-->J

Step 2: Loading model and data

For this step, it will be necessary to check at the level of the needs to have conversions of the model according to the needs.

pdehaye commented 2 years ago

@pdehaye : "Le but n'est pas forcément que la personne qui a la data puisse tout faire sans programmation, c'est davantage que la personne qui a la data puisse trouver quelqu'un avec un ensemble de compétences (programmation) pour arriver au but qu'il souhaite."

A noter que ce quelqu'un peut etre Hestia.ai mais dans un autre mode: non plus comme orchestrateur mais comme utilisateur du service de l'orchestrateur pour receptionner des datas brutes et a partir de la les structurer ("schematiseur"). Cela montre a mon avis que nous devrions nous a l'interne d'abord nous structurer nous-memes autour des APIs qu'on voudrait externaliser.

Cela nous permettrait dans un deuxieme temps de nous positionner de maniere efficiente en mode embedded aupres des POCs qu'on pousse d'abord.

Amustache commented 2 years ago

Discussion du jour :

Subtasks

[ ] Raw export to SQL/DB : @wilderino & @Amustache
[ ] SQL/DB to visu : @fquellec
[ ] SQL/DB to elsewhere : @andreaskundig

The idea is to have something on a really high specification level, that will generate Django models, that can be used to generate whatever afterwards + can be used with JS.

Sidenote, maybe pin that issue, as it is kind of a "master thread" for the next steps.

Amustache commented 2 years ago

Discussion avec @andreaskundig @wilderino @Amustache

Méthodes de filtrage

Méthode 1 (Filtre défini par défaut en amont)

Filtrage défini par un modèle JSON-LD, à partir de ça ça détermine les données, et on les affiche - la visualisation n'aide que à montrer ce que l'on partage.
Le format de données résultante (e.g., un json, une table, ...) va être configuré en donnant le fichier JSON-LD, et à l'export ça donne ce que l'on veut.
A la fin on pourra juste exporter les données fournies par le bloc.

Ici, l'idée est d'appliquer un filtre avant d'utiliser les données et de les transformer en table.

Méthode 2 (Application des filtres à la création de la base de données)

Pour pouvoir appliquer le filtrage plus évolué de @wilderino, avec une version interactive, il faut que l'on puisse passer du modèle de data à une ou plusieurs tables SQL dans la base de données (qui est le point de départ de @wilderino).
Il faudrait pouvoir les définir de la même manière.
Dans Expériences, le filtrage des données serait une nouvelle pipeline SQL, avec une visualisation derrière.

Ici, l'idée est d'appliquer un filtre après avoir créé la table globale dans expérience.

Étapes (mix'd)

Définir le datamodel qui définit les données à utiliser.
Passer du datamodel à une table.
Par-dessus, avoir un filtre à définir.

Amustache commented 2 years ago

I forgot to reference this here, but https://github.com/hestiaAI/tools/blob/main/ApplePrivacyReport/mockup.ipynb may be relevant to that discussion.

hestiaAI / hestialabs-experiences

Overview of desirable features (experience builder, search, sharing...) #571

Overview

1. Finding relevant data

1.a The file explorer

1.b Search

Actions on each search result:

1.c "Data Models"

Actions on data models

Actions on rows/nodes

1.d Storing accessors

2. Visualizations

3. Data export

4. Export/import experiences as json

5. Filter/Anonymize/obfuscate data

Method 1

Method 2

6. possible next steps

Big picture: data semantics and data model

Definitions

Model

Pipelines

I. Creating model from unknown data

II. Enhancing model from known data

III. Using a model

Constraints and steps

Step 0: Getting the data

Step 1: Loading tool for data tagging

Step 2: Loading model and data

Méthodes de filtrage

Méthode 1 (Filtre défini par défaut en amont)

Méthode 2 (Application des filtres à la création de la base de données)

Étapes (mix'd)