ubc-systopia / Indaleko

Indaleko Project
GNU Affero General Public License v3.0
0 stars 1 forks source link

Project Indaleko

Recent Changes

October 18, 2024

There have been some changes around terminology, and I suspect this will lead to a consolidation around this new terminology.

In general, data gathering pipelines are divided into one component that gathers the information, a collector, and a second component that translates the gathered information into a normalized form that can then be inserted into the database, a recorder.

For example, the "indexer" for the file system metadata is logically a "collector" of the information, while the ingester is logically a recorder. Sometimes these stages are combined, sometimes they are further subdivided. For example, in the case of the local file system ingesters ("recorders") they often emit data into a file for bulk uploading.

Some of this is now reflected in the naming system (notably in the activity area of the project.)

I have also removed requirements.txt from the project. There is a pyproject.toml file instead, which captures dependencies. I added a setup_env.py script as well.

The setup_env.py script will set up a virtual environment for you. It will restrict you to using Python 3.12 or newer for the project, and it will download and install the "uv" utility for managing dependencies and configuring a virtual environment. Since this is new, it may not work properly in other environments. Please let me know and I'll work with you to get it working. So far, I've tested it on Windows and Linux.

Introduction

Project Indaleko is about creating a Unified Personal Index. The key characteristics of the UPI model is:

The goal of this research artifact is to demonstrate that having a unified private index with rich semantic and activity data provides a robust base on which to build "personal archiving tools" that enabling easier finding of relevant information.

Architecture

Indaleko is designed around a modular architecture. The goals of this architecture are to break down processing into discrete components, both for ease of implementation as well as flexibility in supporting a variety of devices, storage platforms, and semantic transducers.

Indaleko Architecture

Logically, the project is broken down into various components:

Design

The current project design is focused around evaluating the practicality and efficacy of whether or not we can improve "finding" of relevant digital data in a systematic fashion that works across user devices in a dynamic storage environment that mixes local devices with cloud storage and application quasi-storage. The architecture reflects much on the design philosophy of modular components, with easy extensibility.

Implementation

The current implementation consists primarily of a collection of Python scripts that interact with an Arango database. While in prior work we used a mixture of languages, we chose Python for the current iteration because it provided a robust model for constructing our prototype.

Class model

The implementation is organized around a set of classes. The fundamental class associated with information stored in the database is the Record class, which defines a small amount of information that should be present in everything we store in the database, which includes original captured data (the "raw data,") attributes extracted directly or indirectly (the "attributes,") the source of the information (a UUID identifier and a version number,) and a timestamp of when the relevant information was captured.

The components map to various elements of the architecture:

This prototype system is still under active development. It would be surprising if it does not continue to change as the project moves forward.

Last Updated: January 16, 2024

How to use Indaleko?

In this section, we'll talk about how to set up your system to use Indaleko. The process is a combination of manual and automated steps.

Install Pre-requisites

Things you should have installed:

Set up the database

The simplest way to set up the database is to use the dbsetup.py script. It currently supports three commands:

Note that if you run the script without arguments it will choose to either check your existing database (if it exists) or set one up (if it does not.)

As part of configuration, the script generates a config file that is stored in the config directory. **Note that this file is a sensitive file and will not be checked into git by default (it is in .gitignore). If you lose this file, you will need to change your container to use a new (correct) password. Your data is not encrypted at rest by the database.

This script will pull the most recent version of the ArangoDB docker image, provision a shared volume for storing the database, create a random password for the root account, which is stored in the config file. It also creates an Indaleko account, with a separate password that only has access to the Indaleko database. It will create the various collections used by Indaleko, including the various schema. Most scripts only run using the Indaleko account.

To look at the various options for this script, you can use the --help command. By default this script tries to "do the right thing" when you first invoke it (part of our philosophy of making the tool easiest to use for new users.)

You can confirm the database is set up and running by accessing your ArangoDB local database connection. You can extract the password from the indaleko-db-config.ini file, which is located in the config directory by default. Do not distribute this file. It contains passwords for your database.

Note: the scripts IndalekoDocker.py and IndalekoDBConfig.py have additional functionality for managing docker images and databases.

Set up your machine configuration

Note that there are currently three platforms we are supporting:

The following sections will describe how to configure the various systems.

To install your machine configuration, you should run the correct configuration script for your system. Currently there are three scripts defined:

Machine configuration information is likely to change. Currently we are capturing:

For the moment we aren't requiring any of this. When we have volume information, we associate it with the file via a UUID for the volume. Note: Windows calls them GUIDs ("Globally Unique Identifiers") but they are UUIDs ("Universally Unique Identifiers").

To add the machine configuration to the database you can run the correct script on your machine. Some machines may require a pre-requisite step. For example, Windows requires executing the script window-hardware-info.ps1 to capture the machine information, some of which requires elevated privileges to obtain. Other systems may have similar requirements.

Assuming any pre-requisite script has been run, you can load the configuration data into the database something like the following:

python3 IndalekoWindowsMachineConfig.py --add

Note: use the correct script for your platform. The incorrect script should report an error. Issue #32 is a suggestion to improve this so that one can just use the IndalekoMachineConfig.py script directly and have it "do the right thing".

Windows

There are multiple steps required to set up Indaleko on your Windows machine. Assuming you have installed the database, you should be able to index and ingest the data on your local machine.

Capture System Configuration

Capturing the system configuration on Windows is done using the powershell script windows-hardware-info.ps1, which must be run with administrative privileges (the script is explicitly set to require this, since some of the commands fail otherwise). There are many resources available for explaining this. Here is a video 3 easy ways to run Windows Powershell as admin on Windows 10 and 11 but it's certainly not the only resource.

Note: the output is written into the config directory, which is not saved to git (the entire directory is excluded in .gitignore). While you can override this, this is not recommended due to the sensitive information captured by this script.

Once you have captured your configuration information, you can run the Python script IndalekoWindowsMachineConfig.py. This script will locate and parse the file that was saved by the Powershell script and insert it into the database.

The script has various override options, but aims to "do the right thing" if you run it without arguments. To see the arguments, you can use the --help option.

Index Your Storage

Once your machine configuration has been saved, you can begin creating data index files. This is done by executing the Python script IndalekoWindowsLocalIndexer.py using your installed version of Python.

By default, this will index your home directory, which is usually something like C:\Users\MyName. If you want to override this you can use the --path option. You can see all of the override options by using the --help command.

This script will write the output index file to the data directory. Note that this directory is excluded from checkin to git by default as it is listed in the .gitignore file. Logs (if any) will be (by default) written to the 'logs' directory.

Without any options given, it will write the file with a structured name that includes the platform, machine id, volume id, and the timestamp of when the data was captured.

The index data can be used in subsequent ingestion steps.

Process Your Storage Indexing

An ingester is an agent that takes the indexing data you have previously captured and then performs additional analysis on it. This is the step that loads data into the database. As of this writing, there is only a single ingester written for Windows, which is the script IndalekoWindowsLocalIngester.py. This script knows the format of the index file output, retrieves it, normalizes data that was captured by the indexer, and then writes out the resulting data.

By default, it will take one of the data files (ideally the most recent) and ingest it. The output of this is a set of files that can be manually loaded into the database. The files generated have long names, but those names capture information about the ingested data. Note that the timestamp of the output file will match the timestamp of the index file unless you override it.

There are many override options. To see your options you can use the --help command. This command will also show you which file it will ingest unless you override it.

While the ingestion script does write a small amount of data to the database, it is writing to intermediate files in order to allow bulk uploading. The bulk uploader requires the arangoimport tool, which was installed with the ArangoDB Client tools package.

There are two output files, one represents file and directory metadata. This is uploaded to the Objects collection, which must be specified on the command line.

arangoimport -c Objects <name of file with metadata>.jsonl

We use the json lines format for these files. Depending upon the size of your file, this uploading process can take considerable time.

The second file represents the relationships between the objects and this is uploaded to the Relationships collection, which also must be specified on the command line. Note that these collections should already exist inside the Arango database.

arangoimport -c Relationships <name of file with metadata>.jsonl

The arangoimport tool will tell you how many objects were successfully inserted. This should show no errors or warnings. If it does, there is an issue and it will need to be resolved before using the rest of the Indaleko facilities.

MacOS

This section describes how to set up Indaleko on MacOS X.

Capture System Configuration

Run MacHardwareInfoGenerator.py to get the config your mac. It is saved in the .config directory. It saves the meta-data about your Mac including the name and size of the volumes, hardware info, etc.

python MacHardwareInfoGenerator.py -d ./config

The output will be saved inside the config directory with this name pattern macos-hardware-info-[GUID]-[TIMESTAMP].json. The following is a sample of what you should see:

{
    "MachineGuid": "74457f40-621b-444b-950b-21d8b943b28e",
    "OperatingSystem": {
        "Caption": "macOS",
        "OSArchitecture": "arm64",
        "Version": "20.6.0"
    },
    "CPU": {
        "Name": "arm",
        "Cores": 8
    },
    "VolumeInfo": [
        {
            "UniqueId": "/dev/disk3s1s1",
            "VolumeName": "disk3s1s1",
            "Size": "228.27 GB",
            "Filesystem": "apfs"
        },
        {
            "UniqueId": "/dev/disk3s6",
            "VolumeName": "disk3s6",
            "Size": "228.27 GB",
            "Filesystem": "apfs"
        }
    ]
}

Index Your Storage

Once you have captured the configuration, the first step is to index your storage.

Process Your Storage Indexing

This is the process we call ingestion, which takes the raw indexing data, normalizes it, and captures it into files that can be bulk uploaded into the database. Future versions may automate more of this pipeline.

Linux

Ingestion Validator

After ingesting the index data, it is necessary to ensure that what ended up in the database is what we want, especially in terms of the relationships we define. This is more important during development, while it can be ignored when using the tool.

There is a validators package where it contains the code and scripts for validation. The main validator code is IndalekoIngesterValidator.py. The scripts in the package are used to extract rules that should be checked against the ingested data. The current validator performs the following checks:

Here's how we can use it:

  1. Install jq. It is a powerful tool for working with json and jsonl files.
  2. Run extract_validation.sh passing the path to the index file we ingested:
validators$ extract_validation.sh /path/to/the/index_file

The script creates a validations.jsonl file inside the data folder where each line is a rule to be checked. Here are three examples of these rules:

{"type":"count","field":"st_mode","value":16859,"count":1}
{"type":"contains","parent_uri":"/Users/sinaee/.azuredatastudio","children_uri":["/Users/sinaee/.azuredatastudio/extensions","/Users/sinaee/.azuredatastudio/argv.json"]}
{"type":"contains","parent_uri":"/Users/sinaee/.azuredatastudio","children_uri":["/Users/sinaee/.azuredatastudio/extensions","/Users/sinaee/.azuredatastudio/argv.json"]}
{"type":"contained_by","child_uri":"/Users/sinaee/.azuredatastudio/extensions/microsoft.azuredatastudio-postgresql-0.2.7/node_modules/dashdash/package.json","parent_uris":["/Users/sinaee/.azuredatastudio/extensions/microsoft.azuredatastudio-postgresql-0.2.7/node_modules/dashdash"]}
  1. Run IndalekoIngesterValidator passing the config file path and the validations path
validators$ python IndalekoIngesterValidator.py -c /Users/sinaee/Projects/Indaleko/config/indaleko-db-config.ini -f ./data/validations.jsonl

You should not see any errors; the skipping messages are fine.

How to use Indaleko?

MacOS

To execute the full pipeline, make sure you have installed the necessary prerequisites for this project: docker and python 3.12.

To execute the pipeline, run run.py with the directory you want to index, for instance:

$python run.py --path /path/to/dir

Here are some important points to consider when executing the script:

To view your data, navigate to http://localhost:8529/ and log in using your username and password. You can find these credentials in config/indaleko-db-config.ini under user_name and user_password.

License

    Indaleklo Project README file
    Copyright (C) 2024 Tony Mason

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU Affero General Public License as published
    by the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU Affero General Public License for more details.

    You should have received a copy of the GNU Affero General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

Tooling

Note: as of October 18, 2024. Adding this as I try to migrate towards modern tooling for the project.

UV

This is a pip replacement package manager that I've started to use. You can install it from the UV website. It also handles virtual environments.