#f03c15 The project is now hold on 2 differents repository #f03c15

#f03c15 The official main repository is the IRIT Gitlab repository ( #f03c15

#f03c15 The secondary repository is the Github initial repository ( #f03c15

The github repository is mirrored to follow the progress of the Gitlab repository and to be synchronized.

Data lake architecture POC hosted on OSIRIM (

This repository is the main working repository for the data lake architecture. The goals are :

The development of this architecture will integrate a semantic dimension in the development of services and features.

Table of content

Datalake concept

Data-oriented project are most of the time driven by the use case. Whether it is for employee data storage, for reporting, for anomaly detection, for application operation, etc.., it requires a specific architecture based on a database. alt text Those kind of architecture needs to be secured, well dimensioned and administered. But if another use case is deployed, the whole process has to be done again for each new use case. It costs time and resources (human and server resources).

The data lake allows to share the whole architecture for security, authentication, server allocation services for any kind of data driven project. Moreover, the solution proposed here allow to integrate any kind of data management, data analysis or reporting solution in a unique solution. alt text

It reduces costs for high volumetry and a high variety of data project or companies. This solution has been designed to integrate solutions that has already been deployed, as authentication systems, database solutions or data analysis solutions. The initial resource investment is higher than simple database solution but it is reduced for new other solutions.

Scientific paper

See :

The project : context

This project is supported by neOCampus, OSIRIM, the CNRS, the IRIT and the SMAC team in IRIT.

This project has been started with a internship for 2nd year of Master in Statistic and Decisional Computing (SID at Université Toulouse 3 - Paul-Sabatier in Toulouse) funded by neOCampus. The project has been continued through a 1year-fixed-term contract by the CNRS and the OSIRIM platform.

From March 2020 to now : The project is designed, handled, maintained by DANG Vincent-Nam (for neOCampus / IRIT / OSIRIM / CNRS / Toulouse university).

Beginning 2021 : The project has been joined with a patronage by Modis engineers based on project designed by DANG Vincent-Nam.

The project : design

Zones are as in this theoritical solution (see State of art : "Datalake : trends and perspective" paper).

alt text

The architecture design is described in the next diagram :

alt text

Current architecture : Zone-based service-oriented datalake

The architecture is divided in 6 zones :

Raw data management zone (a.k.a. landing area)

The purpose of this area is to handle, store and make available raw data. Each data is stored as is waiting to be processed and transformed into an information.

This area is the core of the architecture, where every data are stored and every service of this architecture are working on. This area is composed by 1 service :

With a network-oriented vision, this zone is a buffer zone.

Metadata management zone

-> TODO : redesign this area :

As the Neo4J seems to not be the solution due to legal problem with the neo4j license (only commercial license allow deployment + authentication ), other solution may be evaluated :

Process zone

This area is headed by Apache Airflow that will handle and manage every workflow and jobs of data processing and sub services deployed and managed by Apache Airflow :

The deployment of a Hadoop cluster has been thought but the idea could be not implemented or kept. Indeed, hadoop is designed to handle large files and can't handle well small files (cf default block size : 64MB ou 128MB)

Processed data zone (a.k.a. consumption zone, access zone or gold zone)

This area is there to create values over data. Its role is to provide information and allow external application to work on data. The processed data area is supposed to host any database that is needed by users. As no real need have been expressed, no real use case are implemented here. But some use cases have been imagined :

Services zone

This functional area includes every service to make this platform user-friendly. At this point (23/11/2020), 3 services have been designed :

(12/04/2021) Data consumption may be done through RESTful API, direct access to real-time database (InfluxDB as an exemple) or create direct access

API REST, interaction with code

Web GUI, interaction with interface

(Documentation first draft)

The graphical interface consists of several distinct features. (For the moment, the interface is designed mostly in French)

Upload module :

alt text Files can be uploaded (in "Upload" tab) to the Datalake architecture in 2 ways:

The file type (based on MIME type) must be filled in to assign the appropriate processing pipeline (see Apache Airflow) and an associated metadata model.

alt text The metadata template can be chosen from the list of templates associated with the file type. alt text Once selected, the list of metadata of the model are displayed, allowing to modify the values of these metadata. It is possible to add new/remove metadata or modify the model. The models can be made visible to all users of the architecture or they can be kept confidential.

alt text

Upload tracking module :

A follow-up of the data upload can be done (in the tab "Tracabilité"). It is possible to see the data being uploaded as well as the upload status, the start date and the last update date of the upload status.

alt text

It is possible to view the upload status of the latest data.

alt text

Download module :

The stored data can be downloaded (in the Download tab).

The raw data alt text and processed data are accessible and can be selected for local download.

alt text

Data visualisation module :

A basic visualization of the data inserted in the data lake (currently time series) can be done (in the "Data Visualization" tab). alt text

The data can be viewed in graphical form, allowing for easier selection of the data to be downloaded. They can also be visualized in a table format. alt text

Anomalies module (feature in proof of concept stage) :

A function of detection of anomalies on the times series was set up and makes it possible to visualize on the various measurements the abnormal values of statements of sensors. alt text

Login module :

For the moment, to access the interface's features, it is necessary for a user to authenticate himself. The first access page is the login page which requires a login and password (from Openstack Keystone). alt text

MQTT Flow creation

The management of MQTT sensor retrieval can be managed via the graphical interface allowing to manage the sensor retrievals by MQTT. The flows can be visualized (in the tab "MQTT flows"). It is possible to visualize the MQTT flows in active

alt text as well as all active and non-active flows. alt text A flow can be modified by pressing the "Modify" button. alt text A flow can be disabled to stop listening to data. alt text

User administration

Administrators have additional functionality with user management. It is possible to view the list of users, alt text to see the privileges of a user alt text and add access to a user. alt text

Security, authentication and monitoring zone

The purpose of this area is to make it possible to monitor the whole architecture for administrators and give 3 level monitoring. The area has to be adapted to the host platform so services could change with deployment.

This area has to be work more to better design it. Prometheus could be used to monitor Network and System level and other services could be used for other levels.

Data lifecycle

This project is fully placed in the Big data world (see "4 V's of Big data", sometimes 5 or more V are described.). We split the data into 2 distinct groups with distinct goals, perspectives and requirements :

Batch data main goal is to be stored. Batch data are defined to stay on disk with a long lifespan. Those data will be processed, valuated and consumed on a different timescale than the production.

Stream data main goal is to be processed. Stream data are defined to be generated and as quickly as possible processed for consumption. This data is eventually stored to be consumed again later or in another way.

The architecture handle batch data as main goal but it will be enhanced to handle stream data and eventually near-real time data processing, depending on implementation and tools used.

Main data life cycle : Batch data

The global data life in this architecture is described in this diagram :

alt text

Batch data can also be split into 2 distinct groups : light process and heavy process. The light processes include the reading, formatting processes and all the processed made to make the data available. Heavy processes are processes which aim to create data from master data such as data mining, data analysis or machine learning.

alt text

TODO: Explanation of diagram

Light process

alt text

alt text

  1. Data are inserted through API designed in the service zone. All batch data are inserted through the RESTful API. For security reasons, the RESTful API (Flask) is accessible behind an Reverse Proxy. The RESTful API handle insertions of data and metadata (respectively Openstack Swift and Mongodb).
  2. Web trigger is raised from the Openstack Swift proxy triggering an Apache Airflow workflow with minimum metadata to retrieve the data.
  3. Apache Airflow workflow process metadata (type, project or every metadata needed to branching through the workflow easily). Data is eventually read from Openstack Swift and each process are logged in metadata zone.
  4. The last step of each workflow has to be the storage of each processed data. Intermediate results could be stored if needed and all the data could even be stored in Openstack Swift.
  5. Once the data processed, it can be consumed through the RESTful API from tools in consumptions zone. A direct access could be designed if needed.

Heavy process

alt text

alt text

For heavy process, minor differences can be observed in the process area, otherwise no differences exist.

  1. The data is send to Apache Spark cluster. Data are processed in the same way as with Apache Airflow except that resources are allocated by Apache Spark. It can be seen as a tool in Apache Airflow tool box.

Extended data life cycle : Stream data

alt text

alt text

alt text Stream data pipeline is different as goals are differents. The first step is to define and instantiate the stream. As it is not a step of data input, it is the 0 step.

  1. User ou administrator has to create a stream. As we can assume that each data in the stream (especially in IoT) as the same metadata, the metadata are defined 1 time for the whole stream. Each sample will be linked to this metadata. This way, Apache Airflow can instantiate the stream through Apache Spark and Apache Spark Stream.

    1. Apache Airflow instantiate a stream lifechecker workflow scheduled at regular interval. It will be able to update metadata (exemple : number of samples).
  2. The sample is send to Apache Spark Streaming. The sample is directly sent to Apache Spark to be processed as faster as possible.

  3. 2 tasks have to be done in parallel. We have to process the data and store it in a consumption tool as the main objective and store it in the raw data storage.

  4. This step is depending on the implementation. Indeed, the more secure way is to define consumption through RESTful API (and reverse proxy) but it could be not possible with tools used and a direct access will be needed.

The same pipeline is defined for near-real time but some customization may be needed. Indeed, the time constraint is conditionned by tools performances and tools used. We place ourselves in near real-time, not in real-time or hard real-time. The goal is to consume the data in the same time scale as the production.

Metadata management

alt text

TODO : Explanation - Aimed metadata management system - actul : only mongodb

Metadata management system

The metadata is managed in the metadata management system. This system is currently constituted by a NoSQL MongoDB database. This system will be enriched later by a Neo4J database (see What's coming up: metadata management system.

The main interest of the MongoDB database is the possibility to have a semi-structured data model allowing the possibility to modify, add and enrich the metadata document without constraints. There are more than 3000 meta-data models for data description and catalog construction. With the objective of building a data catalog and a data recommendation tool, it is necessary not to restrict to one data format.

Metadata management system : MongoDB

MongoDB aims to keep the metadata documents of each data and the various histories. It allows to store via its maximum 16 MB of data a large amount of data. Moreover, the primary objective of the database ' We can store a large volume of data while allowing efficient search in these data, especially via the query tool that MongoDB offers and an implementation of MapReduce.

2 databases in MongoDB have been created for the functioning of the architecture:

Stats database

The "swift" collection contains 1 unique document instantiated at the initialization of the architecture which evolves throughout the operation of the architecture. This document is built as follows:

    "_id" : MongoDB default ID, 
    "type" : "object_id_file", 
    "object_id" : last_available_swift_id 

Its purpose is to hold the identifier by which the next data should be renamed.
Designed as a simple counter at the moment, its purpose is to allow counting of the number of total objects kept.

This collection can be used to keep other data calculated (or not) on the life statistics of the architecture. We can think of : statistics on exchanges, a list of counters or even a list of metadata, search results to accelerate the search of future recommendation tools.

Swift database

The "Swift" database is the metadata database of each data. Each collection corresponds to a user of the database or a container in Swift. Each document is built as follows:

 "_id" : MongoDB default ID, 
 "content_type" : type of data / MIME type ,
 "data_processing" : type of data processing in Airflow ("custom" or "default") for pipeline choosing,
 "swift_user" : authenticated user that inserted the data, 
 "swift_container" : Openstack Swift container referring to project / user group,
 "swift_object_id" : id from "object_id_file" in stats database,
 "application" : description of the purpose of the data,
 "original_object_name" : original name ,
 "creation_date" : ISODate("..."),
 "last_modified" : ISODate("..."),
 "successful_operations" : [ ] : list of successful operations done on the data,
 "failed_operations" : [ ] : list of failed operations done on the data, 
 "other_data" : {...} : anything that is needed to know on the data (custom metadata inserted by user) 

Needed metadata

Return to the table of content

The required metadata is the metadata that must be filled in to make the architecture work.

Field Utility Filling (automatic / manual) Editable
_id MongoDB default identifier. Allows to uniquely identify the document. Automatic No
content_type Data type Automatic (manual overwrite if needed) Yes
data_processing Type of service to provide "custom" or "default"; a set of default services is set up to provide a minimum service Manual (Default: default) Yes
swift_user User who owns the data, defined as the user who inserts the data Automatic Yes
swift_container swift container in which the data is inserted Manual Yes
swift_object_id Unique identifier for objects in Swift Automatic No
original_object_name Original name of the data; necessary to rename the data in the upload Automatic (manual overwriting possible) Yes
creation_date Date of creation of the object; corresponds to the date of insertion of the data Automatic No

Optional metadata

Return to the table of content

Optional metadata is metadata that has been added and whose absence does not disrupt the proper functioning of the architecture in a basic utilsiation scenario.

Field Utility Filling (automatic / manual) Modifiable
last_modified Last_modified date; allows to follow the life of the data Automatic No
successful_operations List of successful operations on the data; allows to follow the life of the data Automatic No
failed_operations List of failed operations; allows to follow the life of the data Automatic No
application Textual description of the data Manual Yes
Other_data JSON document / Dictionary to add additional metadata in key/value format Manual Yes

The project : features

Return to the table of content



alt text


Sizing : Kubernetes

Return to the table of content


Automatic deployment with Ansible

The automatic deployment of the architecture is done via Ansible playbooks. For now, only the Docker version installed on Linux servers (Centos 7) has been implemented. It remains to implement the version with containers deployed on a Kubernetes cluster.

To automatically install the architecture, several steps are required:

Security management

Return to the table of content

Return to the table of content

The access to the datalake services is done through 1 single point: the REST API. The only entry is here. Warning: Data streams are done directly by Apache Spark Streaming. However, the implementation of the data streams are done via the REST API. (To work on)

Flow control and networks

Still not handled

Security : by tools

Section to reformat

Identification : Openstack Keystone

Authentication is handled by Openstack Keystone and an LDAP directory. An LDAP directory is deployed by default, but it is possible to configure Openstack Keystone to use a pre-existing authentication base. Calls to the various services require token authentication delivered by Openstack Keystone.

Authentication : Server Kerberos

Still not implemented


Still not implemented

Monitoring : Kubernetes monitoring tools

Still not implemented

Monitoring : low level monitoring : SNMP

Still not implemented

Monitoring : Openstack Ceilometer

Still not implemented

User spaces compartmentalization

Still not implemented

User spaces compartmentalization : Virtualization, Docker and Kubernetes

Partially implemented

User spaces compartmentalization : Openstack Keystone

Automation, tests, builds

Automation, tests, builds : CI/CD Pipeline, builds servers

Still not implemented

alt text

Automation : Apache Airflow and service lifecheck

Still not implemented

Technical description

Tool by tool

Aiflow DAG tools in the apache_airflow/dag/lib folder has a special nomenclature :


What has been implemented

alt text

The project has been tested through a Proof of Concept hosted on Osirim and hosted on several VM on a VM Ware virtualization server. Each service has its own virtual machine except for access zone databases that are all hosted on the same VM. Data storage is made on a NFS bay. Only 2 users are allowed to access this network with SSH. At this point (23/11/2020), the POC is not adapted for this platform and wont be deployed in a production state on OSIRIM.

Network description

alt text

Still not implemented TODO : Explanation - 6 differents networks : 1 for each need

alt text

TCP Ports used

Raw data area :

Process zone :

Consumption zone :

TODO : Openstack, MongoDB, API Rest for insertion, web gui, etc.. Openstack Swift : REST API (see documentation) MongoDB : API in several languages (Pymongo in Python as an example) Airflow : Web server GUI and REST API (see documentation)

API paths : ...

Data exchanges between services

Data exchanges at this point are described in the following schema. alt text TODO : Explanation


alt text TODO : Explanation

Still undefined

alt text

TODO : Explanation

Data formats

Openstack Swift

Object inserted in Openstack swift are renamed with a number id. This id is incremented by 1 for every object insert. It allows to follow easily the number of object stored in Openstack Swift.

Only the renamed data are store in Openstack swift. Every metadata are stored in the metadata database (MongoDB). Each object is stored on a container that match to the project or the user group / team.

MongoDB metadata database - TO REFACTOR

(23/11/2020) MongoDB contains :

Getting started

The deployment is done through Ansible playbook in deployment_scripts/ansible.

To deploy the architect, several configuration have to be done :

All have not been tested, but main part of playbook are written. Some modifications may be needed.


Insert a new data

TODO : Batch and Stream data insertion

Batch data

To develop a tool to insert data in the datalake, you have to :

The "" is a example script made to add a new data. It has been done to do test but it can be reused to make an insertion script or a REST API. If you want to insert data in the datalake (a file) : use the "insert_datalake()" function in ""

Stream data

Process a data already inserted

There is a document in "stats" database in "swift" collection in MongoDB that contains list of data to process that will be check every 5 minutes by "Check_data_to_process" dag. Adding a swift You'll have to add a document in this list containing :

For each data in this list, it will trigger a "new_input" dag to process this data. DISCLAIMER : "new_input" is actually disable for testing. The actual pipeline is "test" until integration of new pipeline has been done.

Access to services - TODO

Create an Airflow job

Jobs (or tasks) are done through Operators in Airflow ( The definition of a job or a task is done through a python script. To create a task that will fit in one or more pipeline, an operator has to be used which are defined in the Airflow package. Several operators exists and each one is used for a specific use, including :

The python operator may be the most useful. To use it, there are 2 steps to follow :

def print_context(ds, **kwargs):
return 'Whatever you return gets printed in the logs'

run_this = PythonOperator(

A context can be provided with PythonOperator (always provided in version 2.0) that will allow to give arguments to the function through the **kwargs dict. You can use it as a dictionary and create new keys to provide the data you need. This dictionary contains already a lot of information over the dag run (date, id, etc...) but also contains a "ti" or a "task_instance" (depends on .. ?) key that contains the XCom (for cross-communication) object that allow to pull and push information or objects.



kwargs["key"] = value




Look at the documentation for more information (

Create an Airflow pipeline

Airflow is based on DAG (Directed Acyclic Graph) to implement pipeline / workflows ( The definition of a pipeline / workflow is done through a python script The definition of a pipeline is quite straight forward :

default_args = {
    'start_date': datetime(2016, 1, 1),
    'owner': 'airflow'

dag = DAG('my_dag', default_args=default_args)

The DAG can be customized with parameters.

task_1 = DummyOperator('task_1', dag=dag)
task_2 = DummyOperator('task_2', dag=dag)
task_1 >> task_2 # Define dependencies

The "task_2" will be linked to "task_1" and will be run after it. Each task can be define with a run condition (as "all_success", "all_failed", "at least 1 task is successful", etc..). It is possible to create several branches to make several way for processing. The tools used for it are branching operators (see Branching is done the same way but you can link a list of task to branch it. The branching will have to return the task name of the next task to run.

branch_operator >> [way_1 , way_2] 

Integrate a new process pipeline in Airflow

04/01/2021 : Right now, it is not possible to easily add a pipeline or a task in Airflow. The way to do it is to change the actual working pipeline. Indeed, only one is triggered by the Openstack Swift proxy when a new data is added.

To add a task or a sub-pipeline / sub-workflow, it will be needed to modify the "./apache_airflow/dags/" (at the end of the script) and modify the "custom path" in the dag:

custom >> [the_first_task_of_the_sub_pipeline]
the_first_task_of_the_sub_pipeline >> ... >> join

TODO : Add a way to read and parse files in directory and create jobs and dags in function of the content.

For the implementation on OSIRIM, as access are restricted, the best way to add pipeline is to create a python script with :

The new pipeline will be added in the "custom" branch as a new way. Tasks have to be named but 2 tasks have to have different name. The naming convention will be :


with PROJECT the name of the project or the team in which you work in / with, USER is your username, TASKNAME is a string that quickly describe the task (example : data_cleaning, feature_extraction, etc...). It will be easy and fast to integrate the new pipeline.

What's coming up - Roadmap

Incoming : Interoperable datalake federation + distributed metadata management systems

TODO : Implementation

TODO list for the project development (i.e. the datalake architecture) TODO : Update TODO list

Another approach to metadata management will be adopted with distributed metadata management.

Other information

Problems already encountered

MsSQL 20xx

The Microsoft Server SQL 20xx (i.e. 2017 or 2019) are deployed through Docker container. A password is needed to set up the database which come from :

If this password is not set, the container will crash on boot.


Dont name your task the same name of the callable (python function) : it will lead to an error

Tools not used

In raw data zone

In process zone

In processed data zone

Nothing at this moment.

More documentation

Other markdown files are in folder of each service containing some more information over the service. A pdf is available in the repository. This pdf contains the internship report that I made for the internship. It is mainly made of design thinking.

Tools used in this architecture also have documentation :

Versioning Rules

Versions are defined like this : X.Y.Z-a X : Major (integer) Y : Minor (integer) Z : Build (integer) a : Patches (letter)

08/06/2021 : Major : 1.0.0 will be released with a production-ready solution with tests :

2.0.0 will be focused on metadata management and Kubernetes. (TODO)

Minor : each minor are released when a project is ended :

Build : each build define a sprint or a (bunch of) feature(s) developped.

Patches : for a bug fix


The whole project is (C) 2020 March, DANG Vincent-Nam

during an end of study internship for Université ToulAPI description - TODOouse 3 Paul-Sabatier (FRANCE), IRIT and SMAC Team, neOCampus as original designer of the project / solution / architecture and basis code owner

and for CNRS (as a 1-year fixed-term contract) as developer of the project / solution / architecture and

is licensed under the SSPL, see `'.

Institutions involved

Scientific interest group neOCampus (funder of the internship at the initiative of the project):

A group of laboratories, organizations and industrialists to build the campus of the future

alt text

Toulouse University : Université Toulouse 3 - Paul Sabatier

alt text

Toulouse computer science research laboratory :

IRIT - Insitut de Recherche en Informatique de Toulouse

alt text

French National Center for Scientific Research - Centre national de la recherche scientifique

alt text

Engineers from Modis France (

alt text

worked on this project in a patronage project with Université Toulouse 3 Paul-Sabatier and neOCampus.


04/01/2021 :

*DANG Vincent-Nam (Repository basis code owner, intern and project main maintainer) /