The OpenRefine Python Client from PaulMakepeace provides a library for communicating with an OpenRefine server. This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, macOS). It is also available via Docker Hub, PyPI and Binder.
works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4, 3.4.1, 3.5.0
One-file-executables:
For Docker containers, native Python installation and free Binder on-demand server see the corresponding chapters below.
A short video loop that demonstrates the basic features (list, create, apply, export):
Ensure you have OpenRefine running (i.e. available at http://localhost:3333 or another URL).
To use the client:
Open a terminal pointing to the folder where you have downloaded the one-file-executable (e.g. Downloads in your home directory).
Windows: Open PowerShell and enter following command
cd ~\Downloads
macOS: Open Terminal (Finder > Applications > Utilities > Terminal) and enter following command
cd ~/Downloads
Linux: Open terminal app (Terminal, Konsole, xterm, ...) and enter following command
cd ~/Downloads
Make the file executable.
Windows: not necessary
macOS:
chmod +x openrefine-client_0-3-10_macos
Linux:
chmod +x openrefine-client_0-3-10_linux
Execute the file.
Windows:
.\openrefine-client_0-3-10_windows.exe
macOS:
./openrefine-client_0-3-10_macos
Linux:
./openrefine-client_0-3-10_linux
Using tab completion and command history is highly recommended:
↹
↑
Execute the client by entering its filename followed by the desired command.
The following example will download two small files (duplicates.csv and duplicates-deletion.json) into the current directory and will create a new OpenRefine project from file duplicates.csv.
Download example data (--download
) and create project from file (--create
):
Windows:
.\openrefine-client_0-3-10_windows.exe --download "https://git.io/fj5hF" --output=duplicates.csv
.\openrefine-client_0-3-10_windows.exe --download "https://git.io/fj5ju" --output=duplicates-deletion.json
.\openrefine-client_0-3-10_windows.exe --create duplicates.csv
macOS:
./openrefine-client_0-3-10_macos --download "https://git.io/fj5hF" --output=duplicates.csv
./openrefine-client_0-3-10_macos --download "https://git.io/fj5ju" --output=duplicates-deletion.json
./openrefine-client_0-3-10_macos --create duplicates.csv
Linux:
./openrefine-client_0-3-10_linux --download "https://git.io/fj5hF" --output=duplicates.csv
./openrefine-client_0-3-10_linux --download "https://git.io/fj5ju" --output=duplicates-deletion.json
./openrefine-client_0-3-10_linux --create duplicates.csv
Other commands:
--list
--info "duplicates"
--export "duplicates"
--apply duplicates-deletion.json "duplicates"
--export --output=deduped.xls "duplicates"
--delete "duplicates"
Check --help
for further options.
Please file an issue if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!
By default the client connects to the usual URL of OpenRefine http://localhost:3333. If your OpenRefine server is running somewhere else then you may set hostname and port with additional command line options (e.g. http://example.com):
-H example.com
-P 80
The OpenRefine Templating supports exporting data in any text format (i.e. to construct JSON or XML). The graphical user interface offers four input fields:
{{jsonize(cells["name"].value)}}
This templating functionality is available via the openrefine-client command line interface. It even provides an additional feature for splitting results into multiple files.
To try out the functionality create another project from the example file above.
--create duplicates.csv --projectName=advanced
The following example code will export...
^F$
in column "gender"macOS/Linux Terminal (multi-line input with \
):
"advanced" \
--prefix='{ "events" : [
' \
--template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
--rowSeparator=',
' \
--suffix='
] }' \
--filterQuery='^F$' \
--filterColumn='gender'
Windows PowerShell (multi-line input with `
; quotes needs to be doubled):
"advanced" `
--prefix='{ ""events"" : [
' `
--template=' { ""name"" : {{jsonize(cells[""name""].value)}}, ""purchase"" : {{jsonize(cells[""purchase""].value)}} }' `
--rowSeparator=',
' `
--suffix='
] }' `
--filterQuery='^F$' `
--filterColumn='gender'
Add the following options to the last command (recall with ↑
) to store the results in multiple files.
Each file will contain the prefix, an processed row, and the suffix.
--output=advanced.json --splitToFiles=true
Filenames are suffixed with the row number by default (e.g. advanced_1.json
, advanced_2.json
etc.).
There is another option to use the value in the first column instead:
--output=advanced.json --splitToFiles=true --suffixById=true
Because our project "advanced" contains duplicates in the first column "email" this command will overwrite files (e.g. advanced_melanie.white@example2.edu.json
).
When using this option, the first column should contain unique identifiers.
OpenRefine does not support appending rows to an existing project. As long as the feature request is not yet implemented, you can use the openrefine-client to script a workaround:
Here is an example that replaces the existing project:
openrefine-client --export myproject --output old.csv
openrefine-client --delete myproject
zip combined.zip old.csv new.csv
openrefine-client --create combined.zip --format csv --projectName myproject
Note that the project id will change. If you want to distinguish between old and new data, you can use the additional flag includeFileSources:
openrefine-client --create combined.zip --format csv --projectName myproject --includeFileSources true
felixlohmeier/openrefine-client
docker pull felixlohmeier/openrefine-client:v0.3.10
Run client and mount current directory as workspace:
docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10
The docker option --network=host
allows you to connect to a local or remote OpenRefine via the host network:
list projects on default URL (http://localhost:3333)
docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --list
list projects on a remote server (http://example.com)
docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H example.com -P 80 --list
Usage: same commands as explained above (see Basic Commands and Advanced Templating)
Run openrefine-client linked to a dockerized OpenRefine (felixlohmeier/openrefine ):
Create docker network
docker network create openrefine
Run server (will be available at http://localhost:3333)
docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:3.5.0
Run client with some basic commands: 1. download example files, 2. create project from file, 3. list projects, 4. show metadata, 5. export to terminal, 6. apply transformation rules (deduplication), 7. export again to terminal, 8. export to xls file and 9. delete project
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --download "https://git.io/fj5hF" --output=duplicates.csv
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --download "https://git.io/fj5ju" --output=duplicates-deletion.json
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --create duplicates.csv
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --list
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --info "duplicates"
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --export "duplicates"
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --apply duplicates-deletion.json "duplicates"
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --export "duplicates"
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --export --output=deduped.xls "duplicates"
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --delete "duplicates"
Stop and delete server:
docker stop openrefine-server
docker rm openrefine-server
Delete docker network:
docker network rm openrefine
Customize OpenRefine server:
If you want to add an OpenRefine startup option you need to repeat the default commands (cf. Dockerfile)
-i 0.0.0.0
sets OpenRefine to be accessible from outside the container, i.e. from host-d /data
sets OpenRefine workspaceExample for allocating more memory to OpenRefine with additional option -m 4G
docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:3.5.0 -i 0.0.0.0 -d /data -m 4G
The OpenRefine version is defined by the docker tag.
Check the DockerHub repository for available tags.
Example for OpenRefine 2.8
with same options as above:
docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:2.8 -i 0.0.0.0 -d /data -m 4G
If you want OpenRefine to read and write persistent data in host directory (i.e. store projects) you can mount the container path /data
. Example for host directory /home/felix/refine
:
docker run -d -p 3333:3333 -v /home/felix/refine:/data:z --network=openrefine name=openrefine-server felixlohmeier/openrefine:2.8 -i 0.0.0.0 -d /data -m 4G
See also:
felixlohmeier/openrefine
openrefine-client (requires Python 2.x)
python2 -m pip install openrefine-client --user
This will install the package openrefine-client
containing modules in google.refine
.
A command line script openrefine-client
will also be installed.
openrefine-client --help
Usage: same commands as explained above (see Basic Commands and Advanced Templating)
Import module cli:
from google.refine import cli
Change URL (if necessary):
cli.refine.REFINE_HOST = 'localhost'
cli.refine.REFINE_PORT = '3333'
Help screen:
help(cli)
Commands:
download (e.g. example data):
cli.download('https://git.io/fj5hF','duplicates.csv')
cli.download('https://git.io/fj5ju','duplicates-deletion.json')
list projects:
cli.ls()
create project:
p1 = cli.create('duplicates.csv')
show metadata:
cli.info(p1.project_id)
apply rules from file to project:
cli.apply(p1.project_id, 'duplicates-deletion.json')
export project to terminal:
cli.export(p1.project_id)
export project to file in xls format:
cli.export(p1.project_id, 'deduped.xls')
export templating (see Advanced Templating above):
cli.templating(
p1.project_id,
prefix='''{ "events" : [
''',template=''' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }''',
rowSeparator=''',
''',suffix='''
] }''')
delete project:
cli.delete(p1.project_id)
This fork can be used in the same way as the upstream Python client library.
Some functions in the python client library are not yet compatible with OpenRefine >=3.0 (cf. issue #19 in refine-client-py).
Import module refine:
from google.refine import refine
Server Commands:
set up connection:
server1 = refine.Refine('http://localhost:3333')
show version:
server1.server.get_version()
server1.server.version
list projects:
server1.list_projects()
import json
print(json.dumps(server1.list_projects(), indent=1))
create project:
server1.new_project(project_file='duplicates.csv')
project1 = server1.new_project(project_file='duplicates.csv')
Project commands:
open project:
project1 = server1.open_project('1234567890123')
print full URL to project:
project1.project_url()
list columns:
project1.columns
compute text facet on first column (fails with OpenRefine >=3.2):
project1.compute_facets(facet.TextFacet(project1.columns[0]))
facets = project1.compute_facets(facet.TextFacet(project1.columns[0])).facets[0]
for k in sorted(facets.choices, key=lambda k: facets.choices[k].count, reverse=True):
print(facets.choices[k].count, k)
compute clusters on first column:
project1.compute_clusters(project1.columns[0])
apply rules from file to project:
project1.apply_operations('duplicates-deletion.json')
export project:
project1.export(export_format='tsv')
print(project1.export(export_format='tsv').read())
with open('export.tsv', 'wb') as f:
f.write(project1.export(export_format='tsv').read())
templating export (function was added in this fork, see Advanced Templating above):
data = project1.export_templating(
prefix='''{ "events" : [
''',template=''' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }''',
rowSeparator=''',
''',suffix='''
] }''')
print(data.read())
print help screen with available commands (many more!):
help(project1)
example for custom commands:
project1.do_json('get-rows')['total']
delete project:
project1.delete()
See also:
If you would like to contribute to the Python client library please consider a pull request to the upstream repository refine-client-py.
Ensure you have OpenRefine running (i.e. available at http://localhost:3333). If necessary set the environment variables OPENREFINE_HOST
and OPENREFINE_PORT
to change the URL.
The Python client library includes several unit tests.
run all tests
python setup.py test
run subset test_facet
python setup.py --test-suite tests.test_facet
There is also a script that uses docker images to run the unit tests with different versions of OpenRefine.
run tests on all OpenRefine versions (from 2.0 up to 3.5.0)
./tests.sh -a
run tests on tag 3.5.0
./tests.sh -t 3.5.0
run tests on tag 3.5.0 interactively (pause before and after tests)
./tests.sh -t 3.5.0 -i
run tests on tags 3.5.0 and 2.7
./tests.sh -t 3.5.0 -t 2.7
For Linux there are also functional tests for all command line options.
run all functional tests on OpenRefine 3.5.0
./tests-cli.sh 3.5.0
run all functional tests on OpenRefine 3.5.0 with one-file-executable
./tests-cli.sh 3.5.0 openrefine-client_0-3-7_linux
Note to myself: When releasing a new version...
Run functional tests
for v in 2.7 2.8 3.0 3.1 3.2 3.3 3.4 3.4.1 3.5.0; do
./tests-cli.sh $v
done
Make final changes in Git
Build executables with PyInstaller
Run PyInstaller in Python 2 environments on native Windows, macOS and Linux. Should be "the oldest version of the OS you need to support"! Current release is built with:
One-file-executables will be available in dist/
.
git clone https://github.com/opencultureconsulting/openrefine-client.git
cd openrefine-client
python2 -m pip install pyinstaller --user
python2 -m pip install urllib2_file --user
python2 -m PyInstaller --onefile refine.py --hidden-import google.refine.__main__
Run functional tests with Linux executable
for v in 2.7 2.8 3.0 3.1 3.2 3.3 3.4 3.4.1 3.5.0; do
./tests-cli.sh $v openrefine-client_0-3-7_linux
done
Create release in GitHub
Build package and upload to PyPI
python3 setup.py sdist bdist_wheel
python3 -m twine upload dist/*
Update Docker container
Bump openrefine-client version in related projects
Paul Makepeace, author
David Huynh, initial cut
Artfinder, inspiration
Felix Lohmeier, extended the CLI features
Some data used in the test suite has been used from publicly available sources:
louisiana-elected-officials.csv: from http://www.sos.louisiana.gov/tabid/136/Default.aspx
us_economic_assistance.csv: "The Green Book"
eli-lilly.csv: ProPublica's "Docs for Dollars leading to a Lilly Faculty PDF processed by David Huynh's ScraperWiki script