This repository is used for curating the Pofatu dataset, which is browsable online at https://pofatu.clld.org and published on Zenodo.
Released versions of this repository
provide analysis-friendly formats of the data in the dist
directory, in particular:
See also
Hermann, A., Forkel, R., McAlister, A. et al. Pofatu, a curated and open-access database for geochemical sourcing of archaeological materials. Sci Data 7, 141 (2020). DOI: 10.1038/s41597-020-0485-8
In the following, we assume that:
git
,While somewhat tedious, the CSVW data can be queried almost "manually", i.e. with very little tooling. In the following example we will use command line tools available in the csvkit package, but scientific computing environments like R (with R-Studio) or Python (with Pandas) will provide similar (or better) functionality.
From dist/metadata.json
we learn that dist/samples.csv
contains a column location_region
,
which specifies a rough region in which a sample was collected.
$ csvcut -c location_region dist/samples.csv | sort | uniq
AUSTRAL
...
VANUATU
If we are interested in samples from Vanuatu, we can then list ID
s of these:
$ csvgrep -c location_region -m VANUATU dist/samples.csv | csvcut -c ID
ID
reepmeyer2008_ANU9001
reepmeyer2008_ANU9002
...
Once we have identified a sample we are interested in, we can list all measurements recorded in Pofatu about this sample:
$ csvcut -c Sample_ID,parameter,value_string dist/measurements.csv | csvgrep -c Sample_ID -m"reepmeyer2008_ANU9001" | csvcut -c parameter,value_string
parameter,value_string
SiO2 [%],70.98
TiO2 [%],0.36
Al2O3 [%],14.02
FeO [%],3.02
CaO [%],0.99
MgO [%],0.19
MnO [%],0.17
K2O [%],5.67
Na2O [%],4.27
P [ppm],285.3549225066772
Sc [ppm],9.78581002071119
Ti [ppm],1756.8461601308043
V [ppm],1.2513331781460335
...
Of course, querying interrelated data from multiple tables is a lot more convenient using a relational database. To retrieve all measurements for samples from Vanuatu at once, we only need to run the SQL query
select
s.id, m.parameter, m.value_string
from
"samples.csv" as s, "measurements.csv" as m
where
m.sample_id = s.id and s.location_region == 'VANUATU';
We can do so using the SQLite command line program:
$ sqlite3 dist/pofatu.sqlite "select s.id, m.parameter, m.value_string from \"samples.csv\" as s, \"measurements.csv\" as m where m.sample_id = s.id and s.location_region == 'VANUATU'"
reepmeyer2008_ANU9001|SiO2 [%]|70.98
reepmeyer2008_ANU9001|TiO2 [%]|0.36
reepmeyer2008_ANU9001|Al2O3 [%]|14.02
reepmeyer2008_ANU9001|FeO [%]|3.02
reepmeyer2008_ANU9001|CaO [%]|0.99
reepmeyer2008_ANU9001|MgO [%]|0.19
reepmeyer2008_ANU9001|MnO [%]|0.17
reepmeyer2008_ANU9001|K2O [%]|5.67
reepmeyer2008_ANU9001|Na2O [%]|4.27
reepmeyer2008_ANU9001|P [ppm]|285.3549225066772
...
If you installed the Python package pypofatu
, you can run
the query using the pofatu query
subcommand:
$ pofatu query "select s.id, m.parameter, m.value_string from \"samples.csv\" as s, \"measurements.csv\" as m where m.sample_id = s.id and s.location_region == 'VANUATU' limit 10"
INFO SQLite database at dist/pofatu.sqlite
ID parameter value_string
reepmeyer2008_ANU9001 SiO2 [%] 70.98
reepmeyer2008_ANU9001 TiO2 [%] 0.36
reepmeyer2008_ANU9001 Al2O3 [%] 14.02
reepmeyer2008_ANU9001 FeO [%] 3.02
reepmeyer2008_ANU9001 CaO [%] 0.99
reepmeyer2008_ANU9001 MgO [%] 0.19
reepmeyer2008_ANU9001 MnO [%] 0.17
reepmeyer2008_ANU9001 K2O [%] 5.67
reepmeyer2008_ANU9001 Na2O [%] 4.27
reepmeyer2008_ANU9001 P [ppm] 285.35
You can also explore the data using the datasette tool, which provides a user interface to browse the databases in your browser.
For a more real-life example of using Pofatu data, see the cookbook
Exporting the results of a query to CSV is simple. See the relevant section of the SQLite documentation for details.