pofatu / pofatu-data

Data curation for the Pofatu database
https://pofatu.clld.org
7 stars 0 forks source link

curation of dist data #13

Closed tupuni closed 4 years ago

tupuni commented 4 years ago

-- 1 Missing fields in dist/measurements.csv (location in raw xlsx file in parenthesis)

These and all other "method" metadata were originally put in a separate spreadsheet because there can be more than 1 reference sample, fractionation correction, normalization for 1 parameter and 1 method id. A solution would be to include create a separate table as dist/methods.csv

-- 2 Field name modifications in dist/samples.csv : "name" --->  "sample_name" "category" --->  "sample_category" "comment" --->  "sample_comment" "location_loc1" ---> "location_region" "location_loc2" ---> "location_subregion" "location_loc3" ---> "location_locality"

in dist/measurements.csv : "precision" ---> "value_sd" "sigma" ---> "sd_sigma"

xrotwang commented 4 years ago

@Tupuni as far as I can see, a lot of these columns are empty - should they still be included in dist?

tupuni commented 4 years ago

@xrotwang that is true, especially because we still missing the most advanced methodological metadata, but a lot of these will used in the near future (within the next year of so).

xrotwang commented 4 years ago

I also just found out that there are some rows with same method ID and parameter but different Instrument. Is this an error, or is the instrument also reference sample dependent?

tupuni commented 4 years ago

This must be an error, as the method ID is partly defined by Instrument.

xrotwang commented 4 years ago

The same goes for the analysis date - which must be an error, I'd say. It's only about 30 rows or so, and it's the same rows as the ones with the different insturments.

tupuni commented 4 years ago

Ok I need to fix that

xrotwang commented 4 years ago
--> Kneebone-2018-MA_A Mn
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Mn
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Nb
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Nb
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Rb
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Rb
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Sr
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Sr
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Th
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Th
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Y
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Y
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Zn
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Zn
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Zr
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Zr
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
0it [00:00, ?it/s]--> Kneebone-2018-MA_A Mn
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Mn
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Nb
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Nb
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Rb
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Rb
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Sr
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Sr
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Th
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Th
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Y
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Y
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Zn
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Zn
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Zr
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
--> Kneebone-2018-MA_A Zr
date | 2017.0 | 2016.0
instrument | Bruker Tracer III SD | Bruker AXS or Innov-X Alpha
tupuni commented 4 years ago

fixed

xrotwang commented 4 years ago

@Tupuni I propose to leave the fractionation data out for this release - since ot would only add empty columns. Will fix the others.

tupuni commented 4 years ago

@xrotwang ok let's add them when we have content

xrotwang commented 4 years ago

I added tables for reference samples and normalizations now. And to make the data model more consistent, I also factored methodological metadata out into a separate table. (It was inlined in the measurements table for convenience). While this adds a bit of complexity when you want to retrieve this data, it also makes for more concistency and a lot smaller file sizes.

The one big example query we have stays the same, and I'm going to add another example for querying methodological metadata for a particular sample:

select
  distinct
  s.sample_name, s.artefact_id,
  d.parameter, d.value_string,
  coalesce(m.analyst, 'NA'),
  coalesce(n.reference_sample_name, 'NA') as 'normalization_reference_sample',
  coalesce(n.citation, 'NA') as 'normalization_citation'
from
  "samples.csv" as s
  join "measurements.csv" as d on s.id = d.sample_id
  left outer join "methods.csv" as m on m.id = d.method_id
  left outer join "methods_normalizations.csv" as mn on mn.method_id = m.id
  left outer join "normalizations.csv" as n on mn.normalization_id = n.id
where
  s.id = 'Hermann-2017-JASR_At3-229-064'
;
xrotwang commented 4 years ago

The above query results in

sample_name,artefact_id,parameter,value_string,"coalesce(m.analyst, 'NA')",normalization_reference_sample,normalization_citation
...
At3-229-064,TUP-AT3-229,"Dy [ppm]",5.7,"Liorzou, Céline",NA,NA
At3-229-064,TUP-AT3-229,"Er [ppm]",2.8,"Liorzou, Céline",NA,NA
At3-229-064,TUP-AT3-229,"Yb [ppm]",2.0,"Liorzou, Céline",NA,NA
At3-229-064,TUP-AT3-229,"Th [ppm]",5.0,"Liorzou, Céline",NA,NA
At3-229-064,TUP-AT3-229,Sr87_Sr86,"0.702804±8e-06 2σ","Chauvel, Catherine; Sauzéat, Lucie",NA,NA
At3-229-064,TUP-AT3-229,Pb206_Pb204,"20.849898±0.000954 2σ","Chauvel, Catherine; Sauzéat, Lucie",NA,NA
At3-229-064,TUP-AT3-229,Pb207_Pb204,"15.730549±0.000679 2σ","Chauvel, Catherine; Sauzéat, Lucie",NA,NA
At3-229-064,TUP-AT3-229,Pb208_Pb204,"40.126541±0.00202 2σ","Chauvel, Catherine; Sauzéat, Lucie",NA,NA
At3-229-064,TUP-AT3-229,"Age [Ma]",20.03,"Guillou, Hervé",ACs,"Nomade et al., 2005"
tupuni commented 4 years ago

Ok, it would make more sense to showcase the query of the following fields (since reference sample is vastly used) :

Note: These measurements are linked to a method id, not a specific sample id or artefact id.

Note: There are 3 fields with ref sample names in the methods and they should probably be explicitly distinct to make the query transparent :

xrotwang commented 4 years ago

ok, so

select
  distinct
  s.sample_name,
  m.code,
  m.parameter,
  coalesce(r.sample_name, 'NA') as 'Reference sample name (international standard)',
  coalesce(r.sample_measured_value, 'NA') as 'Measured value',
  coalesce(r.uncertainty, 'NA') as 'SD',
  coalesce(r.uncertainty_unit, 'NA') as 'SD Unit'
from
  "samples.csv" as s
  join "measurements.csv" as d on s.id = d.sample_id
  join "methods.csv" as m on m.id = d.method_id
  left outer join "methods_reference_samples.csv" as mr on mr.method_id = m.id
  left outer join "reference_samples.csv" as r on mr.reference_sample_id = r.id
where
  s.id = 'Hermann-2017-JASR_At3-229-064'
;

and

sample_name,code,parameter,"Reference sample name (international standard)","Measured value",SD,"SD Unit"
...
At3-229-064,Hermann-2017-JASR_A,Dy,WSE,6.0,2.97959678408457,%
At3-229-064,Hermann-2017-JASR_A,Er,AC-E,15.93,2.5165074820158275,%
At3-229-064,Hermann-2017-JASR_A,Er,JB-2,2.6,21.838794290669927,%
At3-229-064,Hermann-2017-JASR_A,Er,WSE,3.02,21.17731217801253,%
At3-229-064,Hermann-2017-JASR_A,Yb,AC-E,15.66,1.4079126317055592,%
At3-229-064,Hermann-2017-JASR_A,Yb,JB-2,2.62,2.7650232098860696,%
At3-229-064,Hermann-2017-JASR_A,Yb,WSE,2.48,1.1237823318988296,%
At3-229-064,Hermann-2017-JASR_A,Th,AC-E,16.65,2.755220355842639,%
At3-229-064,Hermann-2017-JASR_A,Th,JB-2,<LD,0.0,%
At3-229-064,Hermann-2017-JASR_A,Th,WSE,3.0,15.028249662053367,%
At3-229-064,Hermann-2017-JASR_B,Sr87_Sr86,"NBS 987",0.710248,8e-06,"2σ"
At3-229-064,Hermann-2017-JASR_C,Pb206_Pb204,"NBS 981",16.9415,0.0052,"2σ"
At3-229-064,Hermann-2017-JASR_C,Pb207_Pb204,"NBS 981",15.4924,0.002,"2σ"
At3-229-064,Hermann-2017-JASR_C,Pb208_Pb204,"NBS 981",36.701,0.007,"2σ"
At3-229-064,Hermann-2017-JASR_D,Age,NA,NA,NA,NA
tupuni commented 4 years ago

Ok thank you that's perfect