pofatu / pofatu-data

Data curation for the Pofatu database
https://pofatu.clld.org
7 stars 0 forks source link

pofatu-data

This repository is used for curating the Pofatu dataset, which is browsable online at https://pofatu.clld.org and published on Zenodo.

Released versions of this repository provide analysis-friendly formats of the data in the dist directory, in particular:

See also

Hermann, A., Forkel, R., McAlister, A. et al. Pofatu, a curated and open-access database for geochemical sourcing of archaeological materials. Sci Data 7, 141 (2020). DOI: 10.1038/s41597-020-0485-8

In the following, we assume that:

Querying the CSVW data

While somewhat tedious, the CSVW data can be queried almost "manually", i.e. with very little tooling. In the following example we will use command line tools available in the csvkit package, but scientific computing environments like R (with R-Studio) or Python (with Pandas) will provide similar (or better) functionality.

From dist/metadata.json we learn that dist/samples.csv contains a column location_region, which specifies a rough region in which a sample was collected.

$ csvcut -c location_region dist/samples.csv | sort | uniq
AUSTRAL
...
VANUATU

If we are interested in samples from Vanuatu, we can then list IDs of these:

$ csvgrep -c location_region -m VANUATU dist/samples.csv | csvcut -c ID
ID
reepmeyer2008_ANU9001
reepmeyer2008_ANU9002
...

Once we have identified a sample we are interested in, we can list all measurements recorded in Pofatu about this sample:

$ csvcut -c Sample_ID,parameter,value_string dist/measurements.csv | csvgrep -c Sample_ID -m"reepmeyer2008_ANU9001" | csvcut -c parameter,value_string
parameter,value_string
SiO2 [%],70.98
TiO2 [%],0.36
Al2O3 [%],14.02
FeO [%],3.02
CaO [%],0.99
MgO [%],0.19
MnO [%],0.17
K2O [%],5.67
Na2O [%],4.27
P [ppm],285.3549225066772
Sc [ppm],9.78581002071119
Ti [ppm],1756.8461601308043
V [ppm],1.2513331781460335
...

Querying the SQLite data

Of course, querying interrelated data from multiple tables is a lot more convenient using a relational database. To retrieve all measurements for samples from Vanuatu at once, we only need to run the SQL query

select
    s.id, m.parameter, m.value_string 
from 
    "samples.csv" as s, "measurements.csv" as m 
where 
    m.sample_id = s.id and s.location_region == 'VANUATU';

We can do so using the SQLite command line program:

$ sqlite3 dist/pofatu.sqlite "select s.id, m.parameter, m.value_string from \"samples.csv\" as s, \"measurements.csv\" as m where m.sample_id = s.id and s.location_region == 'VANUATU'"
reepmeyer2008_ANU9001|SiO2 [%]|70.98
reepmeyer2008_ANU9001|TiO2 [%]|0.36
reepmeyer2008_ANU9001|Al2O3 [%]|14.02
reepmeyer2008_ANU9001|FeO [%]|3.02
reepmeyer2008_ANU9001|CaO [%]|0.99
reepmeyer2008_ANU9001|MgO [%]|0.19
reepmeyer2008_ANU9001|MnO [%]|0.17
reepmeyer2008_ANU9001|K2O [%]|5.67
reepmeyer2008_ANU9001|Na2O [%]|4.27
reepmeyer2008_ANU9001|P [ppm]|285.3549225066772
...

If you installed the Python package pypofatu, you can run the query using the pofatu query subcommand:

$ pofatu query "select s.id, m.parameter, m.value_string from \"samples.csv\" as s, \"measurements.csv\" as m where m.sample_id = s.id and s.location_region == 'VANUATU' limit 10"
INFO    SQLite database at dist/pofatu.sqlite
ID                         parameter      value_string
reepmeyer2008_ANU9001  SiO2 [%]              70.98
reepmeyer2008_ANU9001  TiO2 [%]               0.36
reepmeyer2008_ANU9001  Al2O3 [%]             14.02
reepmeyer2008_ANU9001  FeO [%]                3.02
reepmeyer2008_ANU9001  CaO [%]                0.99
reepmeyer2008_ANU9001  MgO [%]                0.19
reepmeyer2008_ANU9001  MnO [%]                0.17
reepmeyer2008_ANU9001  K2O [%]                5.67
reepmeyer2008_ANU9001  Na2O [%]               4.27
reepmeyer2008_ANU9001  P [ppm]              285.35

You can also explore the data using the datasette tool, which provides a user interface to browse the databases in your browser.

For a more real-life example of using Pofatu data, see the cookbook

Exporting the results of a query to CSV is simple. See the relevant section of the SQLite documentation for details.