Project Indaleko

Recent Changes

October 18, 2024

There have been some changes around terminology, and I suspect this will lead to a consolidation around this new terminology.

In general, data gathering pipelines are divided into one component that gathers the information, a collector, and a second component that translates the gathered information into a normalized form that can then be inserted into the database, a recorder.

For example, the "indexer" for the file system metadata is logically a "collector" of the information, while the ingester is logically a recorder. Sometimes these stages are combined, sometimes they are further subdivided. For example, in the case of the local file system ingesters ("recorders") they often emit data into a file for bulk uploading.

Some of this is now reflected in the naming system (notably in the activity area of the project.)

I have also removed requirements.txt from the project. There is a pyproject.toml file instead, which captures dependencies. I added a setup_env.py script as well.

The setup_env.py script will set up a virtual environment for you. It will restrict you to using Python 3.12 or newer for the project, and it will download and install the "uv" utility for managing dependencies and configuring a virtual environment. Since this is new, it may not work properly in other environments. Please let me know and I'll work with you to get it working. So far, I've tested it on Windows and Linux.

Introduction

Project Indaleko is about creating a Unified Personal Index. The key characteristics of the UPI model is:

Indexing storage in a uniform fashion, regardless of where or what is being stored. Primarily, this means that we collect information and normalize it for local and cloud storage devices.
Utilizing semantic transducers to obtain information about content. The term "semantic transducer" is one introduced by Gifford in the Semantic File System project (SFS) in the early 1990s but remains an important concept that is used today for indexing systems.
Collects and associates extrinsic information about how storage objects are used. We call this extrinsic information "activity context" because it relates to other activities that are ongoing but correlate with storage. For example, the location of the computer (and hence user) when a file is created, the weather conditions, websites being visited contemporaneously with file access and/or creation, the mood of a human user creating content, and interactions between human users (e.g., files you accessed while you were interacting with another user.)

The goal of this research artifact is to demonstrate that having a unified private index with rich semantic and activity data provides a robust base on which to build "personal archiving tools" that enabling easier finding of relevant information.

Architecture

Indaleko is designed around a modular architecture. The goals of this architecture are to break down processing into discrete components, both for ease of implementation as well as flexibility in supporting a variety of devices, storage platforms, and semantic transducers.

Indaleko Architecture

Logically, the project is broken down into various components:

An Indexer is simply a component that identifies storage objects of interest. For example, we have indexers that will look through a collection of local storage devices and collect basic storage information about the various objects stored by the storage device of interest. There is no requirement that the data captured be in any particular format. A motivation for this is that we have found different systems return different information, there are subtle distinctions in how the information is represented, and while there is commonality amongst the metadata, there are sufficient differences that building a universal indexer is a complex task. That "complex task" is, ultimately, one that Indaleko provides at some level. In our current implementation, indexers do not interact (or minimally interact) with the index database.
An Ingester (note, this may not be the best name) is a component that processes indexer output. There is a many-to-one relationship between ingesters and indexers. In our model "ingestion" is the act of taking data from an indexer and then extracting useful metadata from the indexer. While it might seem logical to combine the indexer and ingester together - something we did in earlier versions - we choose to split them for similar reasons that we have distinct indexers. By separating them, we allow specialized ingesters that can process a given indexer's output in a specific way. For example, there is generally an indexer specific ingester that understands how to normalize the metadata captured by the indexer and then store that in the indexer database. This allows us to use a common normalized model, with responsibility of converting the data into that normalized form, yet implementing it in a storage index specific fashion. Ingesters, however, can also provide additional metadata. For example, an Ingester agent could run one or more semantic transducers, elements that extract information about the contents of the file. Examples might include:
- A machine learning based classifier that only processes videos and adds metadata to the index that identifies videos containing cats.
- An EXIF data extractor, that only processes image files with embedded EXIF data.
- A checksum calculator, that computes a family of common checksums for a given file. This can be useful for doing cross-storage device similarity analysis. Some storage engines do not provide checksums, while others do. Even for those that do they may use a variety of different forms. By having a collection of such checksums calculated for a file it becomes possible to identify duplicate files across dissimilar storage locations.
Note that in each of these cases, the benefits of using a semantic transducer are primarily due to the proximity of the file data on the specific device. Once the data has been removed from the local device, such as in the case of cloud storage, it becomes resource intensive to fetch the files and extract additional metadata.
The Indexer database. This is the Unified Personal Index service. While we have chosen to implement it using ArangoDB it could be implemented on other database technologies, whether in tandem or as a replacement.
The activity context components. The concept of Activity Context is related to the observation that human activities are entwined with their use of storage. At the present time, storage engines do not associate related human activity with storage objects. Associating human activity with storage objects and storage activity is one of the key goals of Indaleko. The activity context aspects of Indaleko break down into multiple components:
- An Activity Context Service, which can be used to obtain a handle to the current activity state of the system. Thus, any other component can request a current activity context and then associate that activity context with the storage object. It is also possible for this to be done post factum by asking for an activity context handle relative to a given point in time. Thus, for example, an Ingester could query for a time-relative activity context handle to associate with the storage event at a later time than the actual event. Of course, there may not be any such context available, such as if the file pre-dates the activity context.
- An Activity Data Provider, which is a component that provides data to the Activity Context Service. These are decoupled to allow flexibility in capturing activity data. Our goal is to allow these to be easily constructed so that we can easily test out the benefits of having various types of activity information. Examples of activity data include:
- The location of the device (and thus, by inference, the user of that device.)
- The ambient conditions in which the device (and again, by inference the user of that device,) is located.
- Computer state at a given point in time. This might include the running application(s), the network connections active, etc.
- Interactions between the user and other people. For example, this could be inferred via the user's calendar, or the communications mechanisms they employ, such as e-mail communications, chats on commonly used services such as Slack, Teams, Discord, WhatsApp, etc.
- Storage events, such as file creation, access.
- Web usage, such as websites visited.
- Music being played by the device.
- The mood of the user (there's been a fair bit of work in this area. Here is a Medium Article that describes how to do this, for example.)
- Etc.
Indaleko does not define what that activity data is, but rather provides a framework for capturing it and utilizing it to find human-related associations with storage objects. While we know that such data is useful in augmenting persona data search (see Searching Heterogeneous Personal Data for example.) we do not know what the full range of such data that could be useful is. Thus, this model encourages the development and evaluation of such activity data source providers.

Design

The current project design is focused around evaluating the practicality and efficacy of whether or not we can improve "finding" of relevant digital data in a systematic fashion that works across user devices in a dynamic storage environment that mixes local devices with cloud storage and application quasi-storage. The architecture reflects much on the design philosophy of modular components, with easy extensibility.

Implementation

The current implementation consists primarily of a collection of Python scripts that interact with an Arango database. While in prior work we used a mixture of languages, we chose Python for the current iteration because it provided a robust model for constructing our prototype.

Class model

The implementation is organized around a set of classes. The fundamental class associated with information stored in the database is the Record class, which defines a small amount of information that should be present in everything we store in the database, which includes original captured data (the "raw data,") attributes extracted directly or indirectly (the "attributes,") the source of the information (a UUID identifier and a version number,) and a timestamp of when the relevant information was captured.

The components map to various elements of the architecture:

The Indexer provides a general model for indexing data. Each system - and even specific storage types - can then construct a specialized indexer on top of this. This breakdown is a cross-product of the systems on which the Indexer runs and the storage engine that is being indexed. Broadly speaking, indexing works for:
- Local storage. This is platform dependent and may even vary based upon the type of storage data (e.g., EXT4, NTFS, FAT, UDFS, etc.)
- Cloud storage. This is usually more platform independent. Examples of this would include Google Drive, Dropbox, OneDrive, and iCloud.
- Application storage. At present we have not implemented any indexers for application storage, though we expect to do so in the future.
The Ingester provides a general model for transforming index data into useful metadata that can then be stored in the database. These are typically specialized to the index source, but may also have some platform dependencies.
The Services provide an extensible registration and lookup service to track the various components that add data to the database.
The Database and Collections work together to provide basic access to the ArangoDB database we are using.
The Machine Configuration support, which is used for capturing and storing machine information in the database. These are typically specialized to a given platform.
The Relationship support, which is used to create associations between storage objects and other elements of the system. For example, the parent/child relationship of hierarchical storage can be captured here, but also associations of files with their storage locations can also be captured. Any other relationship(s) that turn out to be useful to capture can also be added via specialization.

This prototype system is still under active development. It would be surprising if it does not continue to change as the project moves forward.

Last Updated: January 16, 2024

How to use Indaleko?

In this section, we'll talk about how to set up your system to use Indaleko. The process is a combination of manual and automated steps.

Install Pre-requisites

Things you should have installed:

Docker this is needed because we use ArangoDB and run it in a containerized environment. The data is stored on the local host system. While it is possible to configure this to use a remote database, that step is not presently automated.
Python this is needed to run the Indaleko scripts. Note there are a number of libraries that need to be installed. There is a requirements.txt file that captures the current configuration that we have been using, though it may work with other versions of the various libraries. It is distinctly possible we've added some dependency and failed to capture it in the requirements.txt file, in which case, please open an issue and/or a pull request.
Powershell this is Windows Only. There is a powershell script that gathers configuration information about your Windows machine. It requires elevation ("administrative privileges") and you must enable running powershell scripts (which is disabled by default.) The script writes data into the config directory, where it is then parsed and extracted by the setup scripts.
ArangoDB Client Tools In order to upload the files into Arango, you need to install the ArangoDB client tools on your system. There are versions for Windows, MacOS X, and Linux. Note: you should not run the ArangoDB database locally. Keep it in the container to minimize compatibility issues. This may require manually disabling it (this was required on Windows, for example.)

Set up the database

The simplest way to set up the database is to use the dbsetup.py script. It currently supports three commands:

check - this will verify that the database is up and running. If not, you will need to try and figure out what is not working.
setup - this will set up a new instance of the database. Note that if you already have an instance set up, it will not overwrite it - it just runs a check for you.
delete - this will delete your existing instance of the database. You can then run the script again to create a new instance.

Note that if you run the script without arguments it will choose to either check your existing database (if it exists) or set one up (if it does not.)

As part of configuration, the script generates a config file that is stored in the config directory. **Note that this file is a sensitive file and will not be checked into git by default (it is in .gitignore). If you lose this file, you will need to change your container to use a new (correct) password. Your data is not encrypted at rest by the database.

This script will pull the most recent version of the ArangoDB docker image, provision a shared volume for storing the database, create a random password for the root account, which is stored in the config file. It also creates an Indaleko account, with a separate password that only has access to the Indaleko database. It will create the various collections used by Indaleko, including the various schema. Most scripts only run using the Indaleko account.

To look at the various options for this script, you can use the --help command. By default this script tries to "do the right thing" when you first invoke it (part of our philosophy of making the tool easiest to use for new users.)

You can confirm the database is set up and running by accessing your ArangoDB local database connection. You can extract the password from the indaleko-db-config.ini file, which is located in the config directory by default. Do not distribute this file. It contains passwords for your database.

Note: the scripts IndalekoDocker.py and IndalekoDBConfig.py have additional functionality for managing docker images and databases.

Set up your machine configuration

Note that there are currently three platforms we are supporting:

Windows - this has been used on Windows 11.
MacOS X - this has been used on MacOS X, though that work is still in progress.
Linux - this has been used on Ubuntu 22.04, though that work is still in progress.

The following sections will describe how to configure the various systems.

To install your machine configuration, you should run the correct configuration script for your system. Currently there are three scripts defined:

IndalekoLinuxMachineConfig.py - this script captures and stores machine configuration for a Linux machine. As of this writing, this script is under development. This note should be updated when it works.
IndalekoMacMachineConfig.py - this script captures and stores machine configuration for a Mac OS X machine.
IndalekoWindowsMachineConfig.py - this script captures and stores machine configuration for a Windows machine.

Machine configuration information is likely to change. Currently we are capturing:

A name for the machine.
An ID (typically a UUID) assigned by the OS to the machine. This means it is really related to the installation and not necessarily the hardware.
Local storage devices, including naming information (e.g., "mount points" which for UNIX based systems are usually relative to a root namespace, while Windows allows for UNIX style mount points and/or distinct drive letters.) The idea is to capture information that allows us to identify the hardware since being able to find information is difficult if the hardware is portable (e.g, portable USB storage) or if the "mount point" changes (removable storage again, but also re-installation of an OS, or even mounting of an old storage device onto a new system.)
Other information of interest, such as CPU information, memory information, network device information.

For the moment we aren't requiring any of this. When we have volume information, we associate it with the file via a UUID for the volume. Note: Windows calls them GUIDs ("Globally Unique Identifiers") but they are UUIDs ("Universally Unique Identifiers").

To add the machine configuration to the database you can run the correct script on your machine. Some machines may require a pre-requisite step. For example, Windows requires executing the script window-hardware-info.ps1 to capture the machine information, some of which requires elevated privileges to obtain. Other systems may have similar requirements.

Assuming any pre-requisite script has been run, you can load the configuration data into the database something like the following:

python3 IndalekoWindowsMachineConfig.py --add

Note: use the correct script for your platform. The incorrect script should report an error. Issue #32 is a suggestion to improve this so that one can just use the IndalekoMachineConfig.py script directly and have it "do the right thing".

Windows

There are multiple steps required to set up Indaleko on your Windows machine. Assuming you have installed the database, you should be able to index and ingest the data on your local machine.

Capture System Configuration

Capturing the system configuration on Windows is done using the powershell script windows-hardware-info.ps1, which must be run with administrative privileges (the script is explicitly set to require this, since some of the commands fail otherwise). There are many resources available for explaining this. Here is a video 3 easy ways to run Windows Powershell as admin on Windows 10 and 11 but it's certainly not the only resource.

Note: the output is written into the config directory, which is not saved to git (the entire directory is excluded in .gitignore). While you can override this, this is not recommended due to the sensitive information captured by this script.

Once you have captured your configuration information, you can run the Python script IndalekoWindowsMachineConfig.py. This script will locate and parse the file that was saved by the Powershell script and insert it into the database.

The script has various override options, but aims to "do the right thing" if you run it without arguments. To see the arguments, you can use the --help option.

Index Your Storage

Once your machine configuration has been saved, you can begin creating data index files. This is done by executing the Python script IndalekoWindowsLocalIndexer.py using your installed version of Python.

By default, this will index your home directory, which is usually something like C:\Users\MyName. If you want to override this you can use the --path option. You can see all of the override options by using the --help command.

This script will write the output index file to the data directory. Note that this directory is excluded from checkin to git by default as it is listed in the .gitignore file. Logs (if any) will be (by default) written to the 'logs' directory.

Without any options given, it will write the file with a structured name that includes the platform, machine id, volume id, and the timestamp of when the data was captured.

The index data can be used in subsequent ingestion steps.

Process Your Storage Indexing

An ingester is an agent that takes the indexing data you have previously captured and then performs additional analysis on it. This is the step that loads data into the database. As of this writing, there is only a single ingester written for Windows, which is the script IndalekoWindowsLocalIngester.py. This script knows the format of the index file output, retrieves it, normalizes data that was captured by the indexer, and then writes out the resulting data.

By default, it will take one of the data files (ideally the most recent) and ingest it. The output of this is a set of files that can be manually loaded into the database. The files generated have long names, but those names capture information about the ingested data. Note that the timestamp of the output file will match the timestamp of the index file unless you override it.

There are many override options. To see your options you can use the --help command. This command will also show you which file it will ingest unless you override it.

While the ingestion script does write a small amount of data to the database, it is writing to intermediate files in order to allow bulk uploading. The bulk uploader requires the arangoimport tool, which was installed with the ArangoDB Client tools package.

There are two output files, one represents file and directory metadata. This is uploaded to the Objects collection, which must be specified on the command line.

arangoimport -c Objects <name of file with metadata>.jsonl

We use the json lines format for these files. Depending upon the size of your file, this uploading process can take considerable time.

The second file represents the relationships between the objects and this is uploaded to the Relationships collection, which also must be specified on the command line. Note that these collections should already exist inside the Arango database.

arangoimport -c Relationships <name of file with metadata>.jsonl

The arangoimport tool will tell you how many objects were successfully inserted. This should show no errors or warnings. If it does, there is an issue and it will need to be resolved before using the rest of the Indaleko facilities.

MacOS

This section describes how to set up Indaleko on MacOS X.

Capture System Configuration

Run MacHardwareInfoGenerator.py to get the config your mac. It is saved in the .config directory. It saves the meta-data about your Mac including the name and size of the volumes, hardware info, etc.

python MacHardwareInfoGenerator.py -d ./config

The output will be saved inside the config directory with this name pattern macos-hardware-info-[GUID]-[TIMESTAMP].json. The following is a sample of what you should see:

{
    "MachineGuid": "74457f40-621b-444b-950b-21d8b943b28e",
    "OperatingSystem": {
        "Caption": "macOS",
        "OSArchitecture": "arm64",
        "Version": "20.6.0"
    },
    "CPU": {
        "Name": "arm",
        "Cores": 8
    },
    "VolumeInfo": [
        {
            "UniqueId": "/dev/disk3s1s1",
            "VolumeName": "disk3s1s1",
            "Size": "228.27 GB",
            "Filesystem": "apfs"
        },
        {
            "UniqueId": "/dev/disk3s6",
            "VolumeName": "disk3s6",
            "Size": "228.27 GB",
            "Filesystem": "apfs"
        }
    ]
}

Index Your Storage

Once you have captured the configuration, the first step is to index your storage.

Process Your Storage Indexing

This is the process we call ingestion, which takes the raw indexing data, normalizes it, and captures it into files that can be bulk uploaded into the database. Future versions may automate more of this pipeline.

Linux

Ingestion Validator

After ingesting the index data, it is necessary to ensure that what ended up in the database is what we want, especially in terms of the relationships we define. This is more important during development, while it can be ignored when using the tool.

There is a validators package where it contains the code and scripts for validation. The main validator code is IndalekoIngesterValidator.py. The scripts in the package are used to extract rules that should be checked against the ingested data. The current validator performs the following checks:

Validates the number of distinct file types, i.e., different st_mode values, to be exactly the same as what we have seen in the index file.
Validates the Contains and Contained By relationships for each folder. The current version only validates the number of children rather than an exact string match.

Here's how we can use it:

Install jq. It is a powerful tool for working with json and jsonl files.
Run extract_validation.sh passing the path to the index file we ingested:

validators$ extract_validation.sh /path/to/the/index_file

The script creates a validations.jsonl file inside the data folder where each line is a rule to be checked. Here are three examples of these rules:

{"type":"count","field":"st_mode","value":16859,"count":1}
{"type":"contains","parent_uri":"/Users/sinaee/.azuredatastudio","children_uri":["/Users/sinaee/.azuredatastudio/extensions","/Users/sinaee/.azuredatastudio/argv.json"]}
{"type":"contains","parent_uri":"/Users/sinaee/.azuredatastudio","children_uri":["/Users/sinaee/.azuredatastudio/extensions","/Users/sinaee/.azuredatastudio/argv.json"]}
{"type":"contained_by","child_uri":"/Users/sinaee/.azuredatastudio/extensions/microsoft.azuredatastudio-postgresql-0.2.7/node_modules/dashdash/package.json","parent_uris":["/Users/sinaee/.azuredatastudio/extensions/microsoft.azuredatastudio-postgresql-0.2.7/node_modules/dashdash"]}

Run IndalekoIngesterValidator passing the config file path and the validations path

validators$ python IndalekoIngesterValidator.py -c /Users/sinaee/Projects/Indaleko/config/indaleko-db-config.ini -f ./data/validations.jsonl

You should not see any errors; the skipping messages are fine.

How to use Indaleko?

MacOS

To execute the full pipeline, make sure you have installed the necessary prerequisites for this project: docker and python 3.12.

To execute the pipeline, run run.py with the directory you want to index, for instance:

$python run.py --path /path/to/dir

Here are some important points to consider when executing the script:

Use --help to view available options.
Newly created index files are always appended to the database by run.py. Consequently, re-indexing the same folder generates warnings. Existing records are not updated at present.
--reset is an argument for run.py that removes all available collections before ingesting new data. Consequently, using it results in the loss of previously indexed data.

To view your data, navigate to http://localhost:8529/ and log in using your username and password. You can find these credentials in config/indaleko-db-config.ini under user_name and user_password.

License

    Indaleklo Project README file
    Copyright (C) 2024 Tony Mason

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU Affero General Public License as published
    by the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU Affero General Public License for more details.

    You should have received a copy of the GNU Affero General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

Tooling

Note: as of October 18, 2024. Adding this as I try to migrate towards modern tooling for the project.

UV

This is a pip replacement package manager that I've started to use. You can install it from the UV website. It also handles virtual environments.

ubc-systopia / Indaleko

readme

Project Indaleko

Recent Changes

Introduction

Architecture

Design

Implementation

Class model

How to use Indaleko?

Install Pre-requisites

Set up the database

Set up your machine configuration

Windows

Capture System Configuration

Index Your Storage

Process Your Storage Indexing

MacOS

Capture System Configuration

Index Your Storage

Process Your Storage Indexing

Linux

Ingestion Validator

How to use Indaleko?

MacOS

License

Tooling

UV