globaldothealth / outbreak-schema

Global.health Day Zero Outbreak schema
2 stars 0 forks source link

Location_information (Priority = High) #6

Open sadiekelly opened 11 months ago

sadiekelly commented 11 months ago

*Minimum EpiCore variables Neighbourhood of residence, Address of residence, Town or city of residence, Administrative level 4/3/2/1/0 of residence

The day0 schema collects location data based upon where the case was reported from.

The T0 toolkit indicates location is based upon the subjects usual residence, which may not be available from G.h sources. In the case that the subjects' residence data is available, additional attributes may be required in the day0 schema.

Ensure that the report location and the residence location do not become aligned when coding as the meaning differs.

lauramerson commented 10 months ago

Agree. The day0 variable should be clearly defined as reporting location to differentiate. I expect G.h's DPIA will have significant limitations on subject residence data. Town/city and corresponding administrative levels could be added if through relevant to a particular outbreak.

sadiekelly commented 8 months ago

The day0 schema should capture both location of report and location of residence as separate variables. Amend existing variable Location_information to Location_information_report and create new variable Location_information_residence.

sadiekelly commented 8 months ago

@aimeehan1 @JacqSauer @kelseytoups

aimeehan1 commented 8 months ago

@julianasopko

aimeehan1 commented 8 months ago

Here are some general thoughts to consider when capturing location.

We do capture specific location details, when provided, to Admin Boundary 3.

Admin 0Country/ 1Province_State/ 2 County_District/ 3 City. City, or even more specific location information (like a hospital name), could also be captured in the Location_comment.

DPIA risk assessment for location for COVID data was suggested: Use as large as possible geographic location (while keeping the value of data point for the purposes of addressing the COVID-19 public health epidemic). Geographic locations are rolled up to the level of admin 3, an effort made to reduce the risk of re-identification.

Our previous data entry method (GoogleSheets) for Location information was a free text entry, so we saw variation with naming /spelling/ capitalization/ punctuation/ spelling of locations. We also saw variation in location spelling when we had to translate from another language -- we set a default to use the native language spelling as our file permitted. Despite best efforts for consistency, different curators entered data differently. For example, Curator A: Pays de la Loire vs Curator B: Pays-de-la-Loire. Data that was intended to be the same location, would sum as different locations.

These variations created discrepancies and errors in our dataset. We tried to QC the data with spot-checks, as time permitted, and reconcile errors. The new curator portal is intended to prompt the user to type a location and then auto-populate the administrative boundaries to hopefully reduce these discrepancies. See ticket 155.

https://github.com/globaldothealth/monkeypox/issues/155

We also observed reporting challenges with Location Data for territories. Some countries began to include case data for their respective territory(s). News media may report and count territory cases separately. Ultimately, It is the responsibility of the curator to determine if a country has included territory data in their cumulative counts, or if these data are separate. For example, the U.K. did not include the territory of Gibraltar in their country totals, but France included their territories (Corsica, Guadeloupe, Martinique, and Saint Martin) in their cumulative counts. The G.h dataset separates territory data. Keeping track of these differences was very confusing and took a lot of curator effort.

aimeehan1 commented 8 months ago

The way sources would display location data differed by country, by report, and changed over time. There was no global standard.

Some weeks data was displayed in a paragraph narrative, some weeks it was in a table, other weeks it was an (unlabeled) country map! A curator would have to compare a previous report(s) to a current report to calculate the delta in cumulative cases to get the number of incident cases for a specific location.

If case data was displayed on a map - a curator would have to Google a location map with corresponding administrative boundaries to the one being used by the source and identify location names. We saw this in France, Brazil, and other countries. See Slack conversation: https://ghdsi.slack.com/archives/C0115SR6V6E/p1658939357668659

You'll also notice in France's report that they displayed cases on maps by Region of Residence, by Reporting Region and were inconsistent in this format from week to week. It was a challenge to disentangle cases and make sure we weren't underreporting or double counting. Some weeks we'd have to skip an update and wait for the next report to be released to reconcile case or location differences.

sadiekelly commented 7 months ago

thank you @aimeehan1 ! I definitely appreciate the difficulties in curation here! Region of residence was added to our schema as a possibility from the core set of epi variables and I think important for us to capture this information so we can see distance travelled to healthcare. It could also be important where a person is a resident of a country different to the one the case was reported from, to make it clear when reconciling case numbers from a country that a person from another country is included. Reporting region (to whatever granularity possible from the specific report), would always be provided. Residence location (country/state/district/city) could be captured separately if available - what do you think?

sadiekelly commented 7 months ago

Residence location can be added to the schema but may not be frequently available, particularly when disentangling aggregated case reports. Residence location can be geolocated similar to report location.

aimeehan1 commented 7 months ago

@sadiekelly Sure -- let's add a column to collect Residence location with the understanding for limitations and assumptions.

sadiekelly commented 2 weeks ago

Recommendation: Residence location can be added to the schema but may not be frequently available, particularly when disentangling aggregated case reports. Residence location can be geolocated similar to report location.