micronutrientsupport / database-architecture

The Postgres database code for the MAPS tool
3 stars 0 forks source link

Biomarker data import format #142

Closed rbroth closed 2 years ago

rbroth commented 3 years ago

We want to agree on a format for importing biomarker data into the system. It should be csv

rbroth commented 3 years ago

@fannymawbey , questions about the dataset:

fannymawbey commented 3 years ago

Le jeu. 29 avr. 2021 à 10:19, Roman @.***> a écrit :

@fannymawbey https://github.com/fannymawbey , questions about the dataset:

  • what is the name? "Malawi DHS survey"?
  • Do you have information about the household that these people live in? We assume that every person surveyed lives in ahousehold and that sometimes multiple people live in the same household.
  • Has the data been adjusted for altitude and smoking?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/micronutrientsupport/database-architecture/issues/142#issuecomment-829075437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7ZVATHRWOLGNMNGUF5QFDTLEQAHANCNFSM43VDYCGA .

fannymawbey commented 3 years ago

It's the result of malaria testing with a Rapid Test Kit. It measures antigens in the blood at the time of the test, but doesn't tell you when the person got infected.

There is another question that is self reported - 'Did you have malaria in the last 2 weeks?', but this is different and we try to avoid using this one if we actually have a test result.

Le jeu. 29 avr. 2021 à 10:35, Roman @.***> a écrit :

  • MALARIA_TEST_RESULT - does this mean that the person has had malaria in the last two weeks?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/micronutrientsupport/database-architecture/issues/142#issuecomment-829086037, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7ZVAXCLY442XMAJD4GJBLTLER6LANCNFSM43VDYCGA .

fannymawbey commented 3 years ago

No, we have to be specific. Sometimes they will collect serum, sometimes plasma, and we need to use the name of the one used by the survey. I think the idea that we had, as shown in the 'threshold' file, is that you will have a micronutrient 'folate' and then a matrix 'plasma, serum, red blood cell, whole blood, urine, breastmilk'.

Le jeu. 29 avr. 2021 à 10:46, Roman @.***> a écrit :

  • Can we treat "serum" and "plasma" as synonyms? For example, we store "serum folate" in our system, but the dataset lists "plasma folate"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/micronutrientsupport/database-architecture/issues/142#issuecomment-829092931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7ZVAWLHTFVMRSE7L5GIETTLETHJANCNFSM43VDYCGA .

rbroth commented 3 years ago
  • yes sometimes they live in the same household. We have variables that allow us to know that. Do you need that?

Yes, our data model assumes that we'll know what household a person belongs to. @spenny-liam is working on providing you with a template csv for importing data; that will have all the columns we want

andy-bevan commented 3 years ago

Hi @fannymawbey I'm looking at defining the import formats for the biomarker survey results data. We already have the survey mentioned above (Malawi MNS 2016). Do you have another example survey we can use?

fannymawbey commented 3 years ago

Ethiopia_National_MNS_report.pdf you can look at this one, as Ethiopia is also a priority country

rbroth commented 3 years ago

Do you have the data for the Ethiopia Survey?

andy-bevan commented 3 years ago

I was just going to ask the same thing. The PDF is obviously based on the raw data, but includes lots of charts and tables of aggregated data...

fannymawbey commented 3 years ago

Yes we do. I can send via email if that's ok?

Le mar. 8 juin 2021 à 11:59, andy-bevan @.***> a écrit :

I was just going to ask the same thing. The PDF is obviously based on the raw data, but includes lots of charts and tables of aggregated data...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/micronutrientsupport/database-architecture/issues/142#issuecomment-856664308, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7ZVARL5QV4UNU75ENI7GTTRXZXTANCNFSM43VDYCGA .

andy-bevan commented 3 years ago

Yep that's fine.

andy-bevan commented 3 years ago

This is the format we would need the data in as a CSV file - see comment below for latest |

rbroth commented 3 years ago

(Just editing the comment to indicate the allowed values for education)

andy-bevan commented 3 years ago

We would also need the metadata about the survey - see comment below

andy-bevan commented 3 years ago

And details about the household see comment below

andy-bevan commented 3 years ago

Thanks @fannymawbey I got the data csv. I may have a few questions...

andy-bevan commented 3 years ago

@fannymawbey for the Malawi data we have a dictionary to explain and translate some of the values. Do we have something similar for the Ethiopia data? For example RESIDENCE values are either 1 or 2 in the data, but what does that translate as?

fannymawbey commented 3 years ago

right, I'll send that over.

Le mer. 9 juin 2021 à 09:47, andy-bevan @.***> a écrit :

@fannymawbey https://github.com/fannymawbey for the Malawi data we have a dictionary to explain and translate some of the values. Do we have something similar for the Ethiopia data? For example RESIDENCE values are either 1 or 2 in the data, but what does that translate as?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/micronutrientsupport/database-architecture/issues/142#issuecomment-857510531, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7ZVAVN7PDWQOUTE3NKQBTTR4TCNANCNFSM43VDYCGA .

andy-bevan commented 3 years ago

@rbroth there are a bunch of boolean columns on the household_member table about the person (is_pregnant, is_smoking, is_ill etc). Then in the biomarker_measurement table there is 'was_fasting'. Just wondering if that should be moved to the hh_member table as it is an attribute of the person, rather than the measurement...?

rbroth commented 3 years ago

The original idea from those tables was that one person could have multiple measurements taken and they'd be reported seperately, so you could have your blood taken after fasting the day before, but you could then eat something and have a urine sample taken after lunch. But you're right, it might a good idea. @fannymawbey just to confirm, biomarker measurements data is always given in this "pre-aggregaed" state, where information about a person is collected at the same time that biomarker measurements are taken?

andy-bevan commented 3 years ago

The CLUSTER field in the data is arriving as a alphanumeric code but the 'survey_cluster' column in the household_member table is numeric. I guess we need to change the data_type in the column, unless I am missing something? @rroth - what do you think - does this deserve an issue?

andy-bevan commented 3 years ago

Following today's meeting, we have established that:

fannymawbey commented 3 years ago

Great, thanks Andy for noting these.

Attached Malawi and Ethiopia files. I added the metadata for Malawi in the excel, just let me know if that looks OK. I checked in Notepad and it looks like everything that should be double quoted is but it's not super easy to see in Notepad so please just let me know if something doesn't look right. thank you! I'll send by email as it looks like I can attach csv here.

andy-bevan commented 3 years ago

Latest format spec: (Note: all column headers in lowercase)

Column Header Data Type Values/Format Description
person_id text Unique identifier for the person
sex text Whether the household member is male or female
education_level text The highest level of education achieved by this person. Can be primary
group_id text PREG, WRA,PSC,SAC,MEN The group id of the person, usually depends on a combination of sex, age, pregnancy status
age_in_months integer The age of the person at the time of the survey
is_breastfed boolean 0,1,null Whether the child is being breastfed; "true" indicates that they are
is_lactating boolean 0,1,null Whether this woman is lactating; "true" indicates that she is
is_pregnant boolean 0,1,null Whether this woman is pregnant; "true" indicates that she is
is_smoker boolean 0,1,null Whether this person smokes tobacco
had_illness boolean 0,1,null Whether this person has been ill in the last two weeks
had_malaria boolean 0,1,null Whether this person has had malaria in the last two weeks
had_diarrhoea boolean 0,1,null Whether this person has had diarrhoea in the last two weeks
survey_cluster text The surveying cluster used for selecting this individual
survey_strata integer The surveying strata used for selecting this individual
survey_weight integer The surveying weight used for selecting this individual
literacy text Examples: 'fully literate', 'Able to read whole sentence', 'Cannot read at all'
weight_in_kg numeric Weight of person in kilograms
height_in_cm numeric Height of person in centimetres
bmi numeric body mass index - can be submitted or calculated from weight and height , if present
interview_date date dd/mm/yyyy Date when person was interviewed/sampled
date_sampled date dd/mm/yyyy The date when the sample was taken from the body
sample_type text Whether this is a urine, blood, or breastmilk sample
haemoglobin numeric The amount of haemoglbin (also known as hemoglobin) in the blood, in grams per litre.
ferritin numeric The amount of (serum or plasma) Ferritn found in the blood, in micrograms per litre
stfr numeric The amount of (serum or plasma) Soluble Transferrin Receptors (sTfR) in the blood, in milligrams per litre
rbp numeric The amount of (serum or plasma) Retinol Binding Protein (RBP) in the blood, in micromoles per litre
retinol numeric The amount of (serum or plasma) Retinol in the blood, in micromoles per litre
rbc_folate numeric The amount of red blood cell Folate in the blood, in nanomoles per litre
ps_folate numeric The amount of serum or plasma Folate in the blood, in nanomoles per litre
vitamin_b12 numeric The amount of (serum or plasma) Vitamin B12 (also known as cobalamin) in the blood, in picomoles per litre
zinc numeric The amount of (serum or plasma) zinc, in micrograms per decilitre
crp numeric The amount of (serum or plasma) C-reactive protein (CRP) in the blood, in milligrams per litre. A value >=5 defines inflammation
agp numeric The amount of (serum or plasma) Alpha 1 acid GlycoProtein (AGP) in the blood, in grams per litre. A value >=1 defines inflammation
iodine numeric The amount of Urinary Iodine, in micrograms per litre.
time_of_day_sampled text am or pm Whether the measurement was taken in the morning or afternoon (pre or post noon local time)
was_fasting boolean 0,1,null Whether the sample was taken while the patient was fasting
Household Details Note that household values must be identical per household id
household_id text Value to uniquely identify the household
urbanity text urban,rural,null Whether this household is located in an urban or rural area
household_expenditure numeric How much money this household spends per month in local currency
wealth_quintile integer 1,2,3,4,5 In what quintile of househol wealth this household is (5=richest, 1=poorest). If values are not identical for a given id then use the lowest value in all rows for that household id
altitude_in_metres numeric The altitude of the household above sealevel in metres. Used to adjust hemoglobin levels when assessing deficiency. If values are not identical for a given id then use the average value per household id
latitude numeric The latitude of the location of the household, or the centroid of the cluster of households this household belongs to. If values are not identical for a given id then use the average value per household id
longitude numeric The longitude of the location of the household, or the centroid of the cluster of households this household belongs to. If values are not identical for a given id then use the average value per household id
region text Region in which the household is situated
Survey Details (only top row needs to be populated)
survey_name text The name of the survey
publication_date text month year The month and year or date when the survey was published
surveying_date_start text month year The month and year or date when surveying and interviewing started
surveying_date_end text month year The month and year or date when no further interviews took place
geometry text The area covered by the survey; all households interviewed by the survey should fall into this area - geometry will be used to represent this area.
geonetwork_uuid text a UUID for Geonetwork. See https://geonetwork-opensource.org/ for further info
survey_weight_factor integer The factor survey weightings have been scaled by
notes text A generic text field for notes about a survey. Can include such as: Whether biomarker measurements were taken from blood plasma or blood serum
survey_type text biomarker' or 'food consumption' Type of survey
Source File Details (only top row needs to be populated)
file_notes text Notes about file preparation, considerations, assumptions etc. might include notes about which data were excluded from the file, and why
created_by text Name of the person or organisation responsible for creating the file
created_date text Date when file created
rbroth commented 3 years ago

Make sure to also update the files in the repo

andy-bevan commented 3 years ago

Hi @fannymawbey One of the new files has a column called 'had_fever'. Is that the same as 'had_illness' or is that a new parameter?

andy-bevan commented 2 years ago

The format was updated in a recent MR on the gitlab project so closing this issue.

https://kwvmxgit.ad.nerc.ac.uk/bmgf-maps/data/db-test-data/-/merge_requests/45