This project consists of a series of extract-transform-load (ETL) pipelines for adding common sense knowledge triples to the MOWGLI Common Sense Knowledge Graph (CSKG).
The CSKG is used by downstream applications such as question answering systems and knowledge graph browsers. The graph consists of nodes and edges serialized in KGTK edge format, which is a specialization of the general KGTK format.
ConceptNet serves as the core of the CSKG, and other sources such as Wikidata are linked to it. The majority of the predicates/relations in the CSKG are reused from ConceptNet.
From the current directory:
python3 -m venv venv
On Unix:
source venv/bin/activate
On Windows
venv\Scripts\activate
pip install -r requirements.txt
The framework uses LevelDB for whole-graph operations such as duplicate checking.
OS X:
brew install leveldb
CFLAGS=-I$(brew --prefix)/include LDFLAGS=-L$(brew --prefix)/lib pip install plyvel
Linux:
pip install plyvel
The RDF loader can use the rdflib "Sleepycat" store if the bsddb3
module is present.
Linux:
pip install bsddb3
Activate the virtual environment as above, then run:
pytest
Activate the virtual environment as above, then run:
python3 -m mowgli_etl.cli etl rpi_combined
to run all of the available pipelines as well as combine their output.
data
directoryThe extract, transform, and load stages of the pipelines write data to the data
directory. (The path to this directory can be changed on the command line). The structure of the data
directory is data/<pipeline id>/<stage>
. For example, data/swow/loaded
for the final products of the swow
pipeline.
The rpi_combined
pipeline "loads" the outputs of the other pipelines into its data/rpi_combined/loaded
directory in the CSKG CSV format.
The mowgli-etl code base consists of:
swow
pipeline for the Small World of Words word association lexiconA pipeline consists of:
_Extractor
abstract base class_Transformer
abstract base class_Loader
subclass), which is usually not explicitly specified by pipelines; a default is provided instead_Pipeline
subclass that ties everything togetherRunning a pipeline with a command such as
python3 -m mowgli_etl.cli etl swow
initiates the following process, where swow
is the pipeline id.
mowgli_etl.pipeline.swow.swow_pipeline
(or adapted from another pipeline id)extract
method of the extractor
on the pipeline. See the docstring of _Extractor.extract
for information on the contract of extract
.transform
method of the transformer
on the pipeline, passing in a **kwds
dictionary returned by extract
. See the docstring of _Transformer.transform
for more information.transform
method is a generator for a sequence of models, typically KgEdge
s and KgNode
s to add to the CSKG. This generator is passed to the loader, which iterates over it, loading data as it goes. For example, the default KGTK loader buffers nodes and appends edge rows to an output KGTK file. This loading process does not usually need to be handled by the pipeline implementations, most of which rely on the default loader.mowgli-etl
typing
module, especially NamedTuple
dataclasses
def f(*, x, y)
), and **kwds
keyword variadic argumentsabc
modulepathlib
moduleWe follow PEP8 and the Google Python Style Guide, preferring the former where the two are inconsistent.
We encourage using an IDE such as PyCharm. Please format your code with Black before committing it. The formatter can be integrated into most editors, to format on save.
Most code should be part of a class. There should be one class per file, and the file should be named after the class (SomeClass
as some_class.py
).
The swow
pipeline is the best model for new pipelines.
Extractors typically work in one of two ways:
data
subdirectory. This is the best approach for smaller data sets that change infrequently.extract
method is called. The data can be cached in the per-pipeline data
subdirectory and reused if force
is not specified. Cached data should be .gitignore
d. Use an implementation of the EtlHttpClient
rather than using urllib
, requests
, or another HTTP client directly. This makes it easier to mock the HTTP client in unit tests.The extract
method receives a storage
parameter that points to a PipelineStorage
instance, which has the path to appropriate subdirectory of data
. Extractors should use this path (storage.extracted_data_dir_path
) rather than trying to locate data
directly, since the path to data
can be changed on the command line.
Once the data is available, the extractor must pass it to the transformer by returning a **kwds
dictionary. This is typically done in one of two ways:
{"path_to_file": Path("the/file/path")}
from extract
, so that transform
is def transform(self, *, path_to_file: Path)
. This is the preferred approach for large files.{"file_data": "..."}
, in which case transform
is def transform(self, *, file_data: str)
or similar. This is acceptable for small data.Given extracted data in one of the forms listed above, the transformer's task is to:
KgEdge
and KgNode
models that capture the dataTransformers can be implemented in a variety of ways, as long as they conform to the _Transformer
abstract base class. For example, in many implementations the top-level transform
methods delegates to multiple private helper methods or helper classes. It is easier to test the code if the logic of the transformer is broken up into relatively small methods that can be tested individually, rather than one large transform
method with many branches.
Note that KgEdge
and KgNode
have legacy factory classmethods (.legacy
in both cases) corresponding to an older data model. These should not be used in new code. New code should instantiate the models directly or use one of the other factory classmethods as a convenience.
The swow
pipeline tests in tests/mowgli_etl_test/pipeline/swow
can be used as a model for how to test a pipeline. Familiarity with the pytest
framework is necessary.
We use the GitHub flow with feature branches on this code. Branches should be named after (e.g., GH-###
) or otherwise linked to an issue in the issue tracker. Please tag a staff person for code reviews, and re-tag when you have addressed the staff person's comments in the code and rebutted the comments in the PR. See the Google Code Review Developer Guide for more information on code reviews.
We use CircleCI for continuous integration. CircleCI runs the tests in tests/
on every push to origin
. Merging a feature branch is contingent on having adequate tests and all tests passing. We encourage test-driven development.