Closed andreaskundig closed 2 years ago
TODO: take the following into consideration: | symbol | |
---|---|---|
C | collective | |
T1 | workshop | |
B1-4 | bubbles | |
R1 | bubble of a researcher | |
M | Map, includes semantics and data | |
S(o,r) | Semantic schema of objects and relations (json-ld that contains accessors, annotations, relations) | |
d | data | |
v | visualization | |
f | filter | |
E | experience | |
x | company (like Twitter) |
Among the things to add to the issue:
Possibility to create (relatively) coherent data from models could be interesting
Okay, in there I will try to formalize a pipeline for the "data model" creation
In both cases of data (unknown and known), the data contains specific data (e.g., a specific ID, a specific name, ...) - the difference is in the lack of description of this data.
This is linked to OP's "1.c "Data Models"" An example of what a model should contain could be the following:
File name | Field name | Endpoint | Type | Description |
---|---|---|---|---|
account-creation-ip.js |
accountId |
account_creation_ip.accountCreationIp.accountId |
Integer | Unique account ID for that user. |
It is very important to allow as many users as possible to benefit from the description of the data. Thus, the semantics should contain at least a high-level textual description of the endpoint (i.e., so that non-tech-savvy users can still understand), and a low-level formal description of the endpoint (i.e., so that a script can exploit the data).
Eventually, an example of the type of data might be useful to add directly into the model, but this is up for discussion.
An example of what a model should look like could be the following:
{
"fileName": "account-creation-ip.js",
"filePath": "account-creation-ip.js",
"fileFormat": "application/javascript",
"description": "What IP was used to create that account.",
"@graph": [
{
"@type": "Integer",
"fieldName": "accountId",
"unique": "True",
"fieldPath": "accountCreationIp/accountId",
"description": "Unique account ID for that user."
},
]
},
Of course, these are examples, and the exact premises shall be worked on and iterated upon, again and again.
There are two cases of use of the pipeline: the case of someone wanting to describe (new) models, and the case of someone wanting to use a model.
graph TD;
Data_source-->Unknown_data;
Data_source-->Known_data;
Unknown_data-->Create_new_model;
Known_data-->Enhance_existing_model;
Create_new_model-->Model;
Enhance_existing_model-->Model;
Model-->Use_model;
Known_data-->Use_model;
Use_model-->Analysis;
Use_model-->Visualisation;
Use_model-->Research;
Use_model-->...;
In both cases, the idea is to allow the model to be augmented as the data is described. Ideally, the end user (i.e. the person who wants to use the data) also has the possibility to complete or amend a model with new elements.
Basically, what is wanted is the following:
graph TD;
I-->II;
II-->II;
II-->III;
III-->II;
graph TD;
Unknown_data-->JSON_format;
JSON_format-->JSON_LD;
JSON_LD-->Model;
graph TD;
Known_data-->JSON_format;
JSON_format-->JSON_LD;
JSON_LD-->Model;
The operations are identical, except that the description will have to augment/correct the existing model.
This may be linked to OP's "2. Visualizations"
Once a template is ready, it should be able to be used in many different cases. The advantage of having a standardised format is that it will be easy to transcribe the information from the template to other existing elements (e.g., pandas
, ...). However, it is worth bearing in mind that you will need to create appropriate conversion tools, depending on your needs.
cont'd
@pdehaye : "Le but n'est pas forcément que la personne qui a la data puisse tout faire sans programmation, c'est davantage que la personne qui a la data puisse trouver quelqu'un avec un ensemble de compétences (programmation) pour arriver au but qu'il souhaite."
This step is facultative but can be a good entry point.
The first part of obtaining data to exploit is to provide easy access to it. There are websites, such as JustGetMyData, which already reference methods for retrieving data, but a dedicated portal allowing the user to query the available data may be interesting. We can add more information, like for instance from JustWhatsTheData, so that the portal is complete.
An example of a prototype can be found on the following picture:
From this portal, we can propose to people, after obtaining their data, to analyse them, locally, and potentially share their results with a collective or more.
flowchart LR
A(Receiving data)-->F{Need converting to JSon?}
F-->|Yes| G{Existing converter?}
F-->|No| H{Existing model?}
G-->|Yes| B[Load converter]
G-->|No| I(Redirect to Issues/PR/form)
B-->C[Convert to JSon]
C-->H
H-->|Yes| D[Load model]
H-->|No| E[Create model]
D-->J[Tagging data]
E-->J
For this step, it will be necessary to check at the level of the needs to have conversions of the model according to the needs.
@pdehaye : "Le but n'est pas forcément que la personne qui a la data puisse tout faire sans programmation, c'est davantage que la personne qui a la data puisse trouver quelqu'un avec un ensemble de compétences (programmation) pour arriver au but qu'il souhaite."
A noter que ce quelqu'un peut etre Hestia.ai mais dans un autre mode: non plus comme orchestrateur mais comme utilisateur du service de l'orchestrateur pour receptionner des datas brutes et a partir de la les structurer ("schematiseur"). Cela montre a mon avis que nous devrions nous a l'interne d'abord nous structurer nous-memes autour des APIs qu'on voudrait externaliser.
Cela nous permettrait dans un deuxieme temps de nous positionner de maniere efficiente en mode embedded aupres des POCs qu'on pousse d'abord.
Discussion du jour :
Subtasks
The idea is to have something on a really high specification level, that will generate Django models, that can be used to generate whatever afterwards + can be used with JS.
Sidenote, maybe pin that issue, as it is kind of a "master thread" for the next steps.
Discussion avec @andreaskundig @wilderino @Amustache
Ici, l'idée est d'appliquer un filtre avant d'utiliser les données et de les transformer en table.
Ici, l'idée est d'appliquer un filtre après avoir créé la table globale dans expérience.
I forgot to reference this here, but https://github.com/hestiaAI/tools/blob/main/ApplePrivacyReport/mockup.ipynb may be relevant to that discussion.
Overview
This issue tries describes a "vision" of the features we might want to develop. It tries to summarize and integrate discussions with Paul-Olivier, Charles, François, Andreas, Hugo and Thomas.
It is structured around the experience builder, which seemed the most important goal at first, but includes thoughts about how to incorporate Hugo and Thomas' work, and other ideas.
The experience builder allows an animator external to hestia to set up experiences for a workshop with little help from hestia and the least possible amount of coding.
Defining an experience involves:
A cross-cutting concern is
We conclude on implementation thoughts
1. Finding relevant data
1.a The file explorer
Files are in a tree and can be opened in a file viewer that shows you "nodes" of data that each have an accessor
1.b Search
Look in files (all, or a subset) for occurrences of a string, a regex. Restrict results by date and location ranges. (See how Thomas restricts dates and locations).
The subset of files being searched could be defined by
(See implementation of search by type in the generic date/location viewers).
Search results are displaye in a table, dates/location could additionally be displayed as a timeline/map
Actions on each search result:
1.c "Data Models"
A data model pairs accessors with annotations and is defined in Hugo's json-ld format. It could be displayed as a tree, or a table
How would the model interact with search ?
(Can we find a more specific name than data model ?)
data models will probably be stored in a django server with which experiences can interact
Actions on data models
Actions on rows/nodes
1.d Storing accessors
Any data you find has an accessor. Possible mechanisms to store an accessor for use elsewhere:
2. Visualizations
To add a visualization:
Sometimes some of these steps will be automatable (like the table button on json viewer nodes).
3. Data export
A wizard allows to build a consent form. A first version could be just editing the json we currently use to define the consent and what data is exported.
During the workshop we need mechanisms for sharing data
4. Export/import experiences as json
Any visualization (or tab, or block) of an experience is completely defined by a json that can be downloaded. Uploading the json adds a tab to an experience.
An experience (or manifest) consisting of a configuration and several blocks, should up/downloadable in the same way.
The same goes for a complete website.
These json configurations could also be used to create a website, permanently add an experience to a website, or add a visualization to an experience.
We might want to automate the creation of experiences/websites somehow. This could require a django server, or could be based on the git/github api in a similar way as netlifyCMS.
This download/upload functionality will not be available for the custom experiences that use dedicated pipeline or visualization code.
5. Filter/Anonymize/obfuscate data
We see two methods for filtering data before it is shared, (also see Hugo's comment in french below)
Method 1
The filter is hardcoded in the experience manifest and its results are displayed in a tab. In the consent form, it appears as one of the visualizations you can choose to share. It's configured as custom pipeline where this options are a JSON-LD data model.
Method 2
The filter is used in an sql pipeline (in what would correspond to a manifest's database.js).
This allows a participant to reveal less about himself when sharing his data, for example by making locations less precise (Thomas is working on things like these), or omitting some data points
Thomas' anonymization takes specific data from the users files, stores them as tables in a database and modifies the database. Anonymization of one table can depend on another table (if a user wants to hide anything he did in Greece, find the dates when he is in Greece in table A, delete everything at these dates in table B).
Note that the filtering would not apply to visualizations defined in terms of accessors as imagined in 2. Instead of accessors an sql select could be used to define a table, for which a visualization could then be created.
6. possible next steps
For a first minimal way to configure a new experience on the website we need:
Advanced experience configuration
Search Improvements
Data sharing
Integration of Hugo's and Thomas' work