October 18, 2024
There have been some changes around terminology, and I suspect this will lead to a consolidation around this new terminology.
In general, data gathering pipelines are divided into one component that gathers the information, a collector, and a second component that translates the gathered information into a normalized form that can then be inserted into the database, a recorder.
For example, the "indexer" for the file system metadata is logically a "collector" of the information, while the ingester is logically a recorder. Sometimes these stages are combined, sometimes they are further subdivided. For example, in the case of the local file system ingesters ("recorders") they often emit data into a file for bulk uploading.
Some of this is now reflected in the naming system (notably in the activity area of the project.)
I have also removed requirements.txt
from the project. There is a pyproject.toml
file instead, which captures dependencies. I added a setup_env.py
script as well.
The setup_env.py
script will set up a virtual environment for you. It will restrict you to using Python 3.12 or newer for the project, and it will download and install the "uv" utility for managing dependencies and configuring a virtual environment. Since this is new, it may not work properly in other environments. Please let me know and I'll work with you to get it working. So far, I've tested it on Windows and Linux.
Project Indaleko is about creating a Unified Personal Index. The key characteristics of the UPI model is:
Indexing storage in a uniform fashion, regardless of where or what is being stored. Primarily, this means that we collect information and normalize it for local and cloud storage devices.
Utilizing semantic transducers to obtain information about content. The term "semantic transducer" is one introduced by Gifford in the Semantic File System project (SFS) in the early 1990s but remains an important concept that is used today for indexing systems.
Collects and associates extrinsic information about how storage objects are used. We call this extrinsic information "activity context" because it relates to other activities that are ongoing but correlate with storage. For example, the location of the computer (and hence user) when a file is created, the weather conditions, websites being visited contemporaneously with file access and/or creation, the mood of a human user creating content, and interactions between human users (e.g., files you accessed while you were interacting with another user.)
The goal of this research artifact is to demonstrate that having a unified private index with rich semantic and activity data provides a robust base on which to build "personal archiving tools" that enabling easier finding of relevant information.
Indaleko is designed around a modular architecture. The goals of this architecture are to break down processing into discrete components, both for ease of implementation as well as flexibility in supporting a variety of devices, storage platforms, and semantic transducers.
Logically, the project is broken down into various components:
An Indexer is simply a component that identifies storage objects of interest. For example, we have indexers that will look through a collection of local storage devices and collect basic storage information about the various objects stored by the storage device of interest. There is no requirement that the data captured be in any particular format. A motivation for this is that we have found different systems return different information, there are subtle distinctions in how the information is represented, and while there is commonality amongst the metadata, there are sufficient differences that building a universal indexer is a complex task. That "complex task" is, ultimately, one that Indaleko provides at some level. In our current implementation, indexers do not interact (or minimally interact) with the index database.
An Ingester (note, this may not be the best name) is a component that processes indexer output. There is a many-to-one relationship between ingesters and indexers. In our model "ingestion" is the act of taking data from an indexer and then extracting useful metadata from the indexer. While it might seem logical to combine the indexer and ingester together - something we did in earlier versions - we choose to split them for similar reasons that we have distinct indexers. By separating them, we allow specialized ingesters that can process a given indexer's output in a specific way. For example, there is generally an indexer specific ingester that understands how to normalize the metadata captured by the indexer and then store that in the indexer database. This allows us to use a common normalized model, with responsibility of converting the data into that normalized form, yet implementing it in a storage index specific fashion. Ingesters, however, can also provide additional metadata. For example, an Ingester agent could run one or more semantic transducers, elements that extract information about the contents of the file. Examples might include:
Note that in each of these cases, the benefits of using a semantic transducer are primarily due to the proximity of the file data on the specific device. Once the data has been removed from the local device, such as in the case of cloud storage, it becomes resource intensive to fetch the files and extract additional metadata.
The Indexer database. This is the Unified Personal Index service. While we have chosen to implement it using ArangoDB it could be implemented on other database technologies, whether in tandem or as a replacement.
The activity context components. The concept of Activity Context is related to the observation that human activities are entwined with their use of storage. At the present time, storage engines do not associate related human activity with storage objects. Associating human activity with storage objects and storage activity is one of the key goals of Indaleko. The activity context aspects of Indaleko break down into multiple components:
Indaleko does not define what that activity data is, but rather provides a framework for capturing it and utilizing it to find human-related associations with storage objects. While we know that such data is useful in augmenting persona data search (see Searching Heterogeneous Personal Data for example.) we do not know what the full range of such data that could be useful is. Thus, this model encourages the development and evaluation of such activity data source providers.
The current project design is focused around evaluating the practicality and efficacy of whether or not we can improve "finding" of relevant digital data in a systematic fashion that works across user devices in a dynamic storage environment that mixes local devices with cloud storage and application quasi-storage. The architecture reflects much on the design philosophy of modular components, with easy extensibility.
The current implementation consists primarily of a collection of Python scripts that interact with an Arango database. While in prior work we used a mixture of languages, we chose Python for the current iteration because it provided a robust model for constructing our prototype.
The implementation is organized around a set of classes. The fundamental class associated with information stored in the database is the Record class, which defines a small amount of information that should be present in everything we store in the database, which includes original captured data (the "raw data,") attributes extracted directly or indirectly (the "attributes,") the source of the information (a UUID identifier and a version number,) and a timestamp of when the relevant information was captured.
The components map to various elements of the architecture:
This prototype system is still under active development. It would be surprising if it does not continue to change as the project moves forward.
Last Updated: January 16, 2024
In this section, we'll talk about how to set up your system to use Indaleko. The process is a combination of manual and automated steps.
Things you should have installed:
Docker this is needed because we use ArangoDB and run it in a containerized environment. The data is stored on the local host system. While it is possible to configure this to use a remote database, that step is not presently automated.
Python this is needed to run the Indaleko scripts. Note there are a
number of libraries that need to be installed. There is a requirements.txt
file that captures the current configuration that we have been using, though
it may work with other versions of the various libraries. It is distinctly
possible we've added some dependency and failed to capture it in the
requirements.txt file, in which case, please open an issue and/or a pull
request.
Powershell this is Windows Only. There is a powershell script that
gathers configuration information about your Windows machine. It requires
elevation ("administrative privileges") and you must enable running powershell
scripts (which is disabled by default.) The script writes data into the
config
directory, where it is then parsed and extracted by the setup
scripts.
ArangoDB Client Tools In order to upload the files into Arango, you need to install the ArangoDB client tools on your system. There are versions for Windows, MacOS X, and Linux. Note: you should not run the ArangoDB database locally. Keep it in the container to minimize compatibility issues. This may require manually disabling it (this was required on Windows, for example.)
The simplest way to set up the database is to use the dbsetup.py
script. It
currently supports three commands:
Note that if you run the script without arguments it will choose to either check your existing database (if it exists) or set one up (if it does not.)
As part of configuration, the script generates a config file that is stored in
the config
directory. **Note that this file is a sensitive
file and will not be checked into git by default (it is in .gitignore
). If
you lose this file, you will need to change your container to use a new
(correct) password. Your data is not encrypted at rest by the database.
This script will pull the most recent version of the ArangoDB docker image, provision a shared volume for storing the database, create a random password for the root account, which is stored in the config file. It also creates an Indaleko account, with a separate password that only has access to the Indaleko database. It will create the various collections used by Indaleko, including the various schema. Most scripts only run using the Indaleko account.
To look at the various options for this script, you can use the --help
command. By default this script tries to "do the right thing" when you first
invoke it (part of our philosophy of making the tool easiest to use for new
users.)
You can confirm the database is set up and running by accessing your
ArangoDB local database connection. You can extract
the password from the indaleko-db-config.ini
file, which is located in the
config
directory by default. Do not distribute this file. It contains
passwords for your database.
Note: the scripts IndalekoDocker.py
and IndalekoDBConfig.py
have additional
functionality for managing docker images and databases.
Note that there are currently three platforms we are supporting:
The following sections will describe how to configure the various systems.
To install your machine configuration, you should run the correct configuration script for your system. Currently there are three scripts defined:
IndalekoLinuxMachineConfig.py
- this script captures and stores machine
configuration for a Linux machine. As of this writing, this script is under
development. This note should be updated when it works.
IndalekoMacMachineConfig.py
- this script captures and stores machine
configuration for a Mac OS X machine.
IndalekoWindowsMachineConfig.py
- this script captures and stores machine
configuration for a Windows machine.
Machine configuration information is likely to change. Currently we are capturing:
For the moment we aren't requiring any of this. When we have volume information, we associate it with the file via a UUID for the volume. Note: Windows calls them GUIDs ("Globally Unique Identifiers") but they are UUIDs ("Universally Unique Identifiers").
To add the machine configuration to the database you can run the correct script
on your machine. Some machines may require a pre-requisite step. For example,
Windows requires executing the script window-hardware-info.ps1
to capture the
machine information, some of which requires elevated privileges to obtain. Other
systems may have similar requirements.
Assuming any pre-requisite script has been run, you can load the configuration data into the database something like the following:
python3 IndalekoWindowsMachineConfig.py --add
Note: use the correct script for your platform. The incorrect script
should report an error. Issue #32 is a suggestion to improve this so that one can just use the
IndalekoMachineConfig.py
script directly and have it "do the right thing".
There are multiple steps required to set up Indaleko on your Windows machine. Assuming you have installed the database, you should be able to index and ingest the data on your local machine.
Capturing the system configuration on Windows is done using the powershell
script windows-hardware-info.ps1
, which must be
run with administrative privileges (the script is explicitly set to require
this, since some of the commands fail otherwise). There are many resources
available for explaining this. Here is a video 3 easy ways to run Windows
Powershell as admin on Windows 10 and
11 but it's certainly not the only
resource.
Note: the output is written into the config
directory, which is not
saved to git (the entire directory is excluded in .gitignore
). While you
can override this, this is not recommended due to the sensitive information
captured by this script.
Once you have captured your configuration information, you can run the Python script
IndalekoWindowsMachineConfig.py
. This script will locate and parse the file
that was saved by the Powershell script and insert it into the database.
The script has various override options, but aims to "do the right thing" if you
run it without arguments. To see the arguments, you can use the --help
option.
Once your machine configuration has been saved, you can begin creating data
index files. This is done by executing the Python script
IndalekoWindowsLocalIndexer.py
using your installed version of Python.
By default, this will index your home directory, which is usually something like
C:\Users\MyName
. If you want to override this you can use the --path
option. You can see all of the override options by using the --help
command.
This script will write the output index file to the data
directory. Note that
this directory is excluded from checkin to git by default as it is listed in
the .gitignore
file. Logs (if any) will be (by default) written to the 'logs'
directory.
Without any options given, it will write the file with a structured name that includes the platform, machine id, volume id, and the timestamp of when the data was captured.
The index data can be used in subsequent ingestion steps.
An ingester is an agent that takes the indexing data you have previously
captured and then performs additional analysis on it. This is the step that
loads data into the database. As of this writing, there is only a single
ingester written for Windows, which is the script
IndalekoWindowsLocalIngester.py
. This script knows the format of the index
file output, retrieves it, normalizes data that was captured by the indexer, and
then writes out the resulting data.
By default, it will take one of the data files (ideally the most recent) and ingest it. The output of this is a set of files that can be manually loaded into the database. The files generated have long names, but those names capture information about the ingested data. Note that the timestamp of the output file will match the timestamp of the index file unless you override it.
There are many override options. To see your options you can use the --help
command. This command will also show you which file it will ingest unless you
override it.
While the ingestion script does write a small amount of data to the database, it
is writing to intermediate files in order to allow bulk uploading. The bulk
uploader requires the arangoimport
tool, which was installed with the ArangoDB
Client tools package.
There are two output files, one represents file and directory metadata. This is
uploaded to the Objects
collection, which must be specified on the command
line.
arangoimport -c Objects <name of file with metadata>.jsonl
We use the json lines format for these files. Depending upon the size of your file, this uploading process can take considerable time.
The second file represents the relationships between the objects and this is
uploaded to the Relationships
collection, which also must be specified on the
command line. Note that these collections should already exist inside the
Arango database.
arangoimport -c Relationships <name of file with metadata>.jsonl
The arangoimport
tool will tell you how many objects were successfully
inserted. This should show no errors or warnings. If it does, there is an
issue and it will need to be resolved before using the rest of the Indaleko
facilities.
This section describes how to set up Indaleko on MacOS X.
Run MacHardwareInfoGenerator.py
to get the config your mac. It is saved in the .config
directory. It saves the meta-data about your Mac including the name and size of the volumes, hardware info, etc.
python MacHardwareInfoGenerator.py -d ./config
The output will be saved inside the config
directory with this name pattern macos-hardware-info-[GUID]-[TIMESTAMP].json
. The following is a sample of what you should see:
{
"MachineGuid": "74457f40-621b-444b-950b-21d8b943b28e",
"OperatingSystem": {
"Caption": "macOS",
"OSArchitecture": "arm64",
"Version": "20.6.0"
},
"CPU": {
"Name": "arm",
"Cores": 8
},
"VolumeInfo": [
{
"UniqueId": "/dev/disk3s1s1",
"VolumeName": "disk3s1s1",
"Size": "228.27 GB",
"Filesystem": "apfs"
},
{
"UniqueId": "/dev/disk3s6",
"VolumeName": "disk3s6",
"Size": "228.27 GB",
"Filesystem": "apfs"
}
]
}
Once you have captured the configuration, the first step is to index your storage.
This is the process we call ingestion, which takes the raw indexing data, normalizes it, and captures it into files that can be bulk uploaded into the database. Future versions may automate more of this pipeline.
After ingesting the index data, it is necessary to ensure that what ended up in the database is what we want, especially in terms of the relationships we define. This is more important during development, while it can be ignored when using the tool.
There is a validators package where it contains the code and scripts for validation. The main validator code is IndalekoIngesterValidator.py. The scripts in the package are used to extract rules that should be checked against the ingested data. The current validator performs the following checks:
Validates the number of distinct file types, i.e., different st_mode
values, to be exactly the same as what we have seen in the index file.
Validates the Contains
and Contained By
relationships for each folder. The current version only validates the number of children rather than an exact string match.
Here's how we can use it:
json
and jsonl
files.validators$ extract_validation.sh /path/to/the/index_file
The script creates a validations.jsonl
file inside the data
folder where each line is a rule to be checked. Here are three examples of these rules:
{"type":"count","field":"st_mode","value":16859,"count":1}
{"type":"contains","parent_uri":"/Users/sinaee/.azuredatastudio","children_uri":["/Users/sinaee/.azuredatastudio/extensions","/Users/sinaee/.azuredatastudio/argv.json"]}
{"type":"contains","parent_uri":"/Users/sinaee/.azuredatastudio","children_uri":["/Users/sinaee/.azuredatastudio/extensions","/Users/sinaee/.azuredatastudio/argv.json"]}
{"type":"contained_by","child_uri":"/Users/sinaee/.azuredatastudio/extensions/microsoft.azuredatastudio-postgresql-0.2.7/node_modules/dashdash/package.json","parent_uris":["/Users/sinaee/.azuredatastudio/extensions/microsoft.azuredatastudio-postgresql-0.2.7/node_modules/dashdash"]}
validators$ python IndalekoIngesterValidator.py -c /Users/sinaee/Projects/Indaleko/config/indaleko-db-config.ini -f ./data/validations.jsonl
You should not see any errors; the skipping messages are fine.
To execute the full pipeline, make sure you have installed the necessary prerequisites for this project: docker
and python 3.12
.
To execute the pipeline, run run.py
with the directory you want to index, for instance:
$python run.py --path /path/to/dir
Here are some important points to consider when executing the script:
--help
to view available options.run.py
. Consequently, re-indexing the same folder generates warnings. Existing records are not updated at present.--reset
is an argument for run.py
that removes all available collections before ingesting new data. Consequently, using it results in the loss of previously indexed data.To view your data, navigate to http://localhost:8529/
and log in using your username and password. You can find these credentials in config/indaleko-db-config.ini
under user_name
and user_password
.
Indaleklo Project README file
Copyright (C) 2024 Tony Mason
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Note: as of October 18, 2024. Adding this as I try to migrate towards modern tooling for the project.
This is a pip replacement package manager that I've started to use. You can install it from the UV website. It also handles virtual environments.