官方wiki—ETL最佳实践

wanghaisheng commented 9 years ago

ETL最佳实践

这篇文档记录了多年以来利用ETL工具将数据导成 OMOP Common Data Model (CDM)时的最佳实践. ETL的整个流程可以划分为四个步骤:

数据专家和CDM专家共同设计ETL
术语专家构建字典间的映射关系
技术专家实现ETL
以上步骤中的质控
1. 数据专家和CDM专家共同设计ETL

ETL的设计需要对原始数据有深入的了解，又要求懂CDM，过去从事过ETL工作的人员能够加快设计的速度。最理想的情况是数据专家和CDM专家能够同居一室进行一个为期1-2天的讨论。

目前我们提供2个辅助ETL设计的工具: White Rabbit 和 [Rabbit-in-a-Hat. White Rabbit](http://www.ohdsi.org/web/wiki/doku.php?id=documentation:software:whiterabbit#rabbit-in-a-hat）

通常情况下，在设计ETL的时候，原始数据的文档是不够详细的，而且多数情况下实际的数据和文档也是存在出入的。在开始前，可以使用White Rabbit对原始数据库进行一个扫描：

1、原始数据库中的所有表
2、每个表中的所有字段
3、每个字段中的所有相异值 (唯一值和频率低的不包括在内)
4、每个值出现的频率

所有表和字段能够了解数据的结构, 值与其频率可以了解信息是如何记录的。 (e.g. 性别编码是 m 、f还是1、2 ), 包括任何可能的特殊值

Rabbit-in-a-Hat

数据专家和CDM专家一起讨论的时候，手头上要有原始数据库和CDM的信息，Rabbit-in-a-Hat 是一个交互式的界面，同时展示原始数据库和CDM。可以把不同的表、字段进行连接，给转换逻辑添加一些批注.

一开始最好先找到原始数据库中的表与CDM表之间的关系，然后定义不同字段与字段间的关系。另外不仅要记录最终的决定，也要记录这样设计的原因，以免后续重复进行讨论

一旦完成了 Rabbit-in-a-Hat 中的设计工作，就可以自动导出一个OMOP 模板的ETL文档。该文档中包含了实现ETL所需的所有信息。

2. 术语专家构建字典间的映射关系

通常原始数据库中的字典并不是OMOP的术语字典，我们需要将原始数据库中的字典与OMOP标化字典进行一一对应。映射工作耗时耗力，可以通过如下方式减轻工作量

Mapping can be a daunting task, especially when there are many codes in the source coding system. There are several things that can be done to make the task easier:

只关注最常用的那些编码。那些从来不用或者用了一两次的编码对于研究工作并无多大意义
充分利用已有信息，比如很多药品编码已经和ATC进行了对应。可以利用ATC与RxNorm之间的概念关系来推断RxNorm编码
使用术语映射工具Usagi.

Usagi是一款旨在辅助人工进行术语映射的工具，可以根据文本描述对可能的映射进行提示. 可以使用google翻译将其他语种翻译成英文。其中也可以自由检索合适的概念，可以对能够在ETL中使用映射关系进行审核。

3. 技术专家实现ETL

一旦完成了ETL设计和术语字典的映射关系，就可以借助软件工具来实现ETL过程。技术的选择依赖于开发人员。可以选择自己熟悉的，诸如SQL, SAS, C#, Java, and Kettle等，各有千秋。

(注意: 在进行了一番尝试之后，我们最终放弃了设计一种用户友好的ELT工具，不管怎么样，总是有20%作用的ETL底层编码需要根据不同的具体的数据结构来编写

4. 以上步骤中的质控

只有确保ETL的质量才能够使用这些转换后的数据，要评估CDM中数据的质量有多种方式，但很多情况下，问题不是出在ETL上，而是本身原始数据库中的数据就有问题。

在质控上的建议如下:

由第三方来对ETL 设计文档、代码、映射关系进行评审
手动比较某个人在原始数据库和CDM中的所有信息
对比原始数据库和CDM中数据的总数
在CDM版本上重现原始数据上完成的某个研究
使用 Achilles 工具.

Achilles是一款可供对患者的人口学信息、病情症状、药物、手术等信息进行描述统计分析的工具。可以很容易的定位数据质量问题，比如由于出生日期解析错误导致的诡异的年龄分布. 其中一个功能是Achilles Heel, 一个罗列了数据质量错误的报表工具，待解决的数据质量问题，需要关注的数据质量的警告提醒等。

wanghaisheng commented 9 years ago

关于如何从原始数据库中抽取数据到Observation_Period表中

http://forums.ohdsi.org/t/observation-period-table-how-do-you-generate-this-table-at-your-site/153/6

第一种提案 For claims, yes would just go with coverage period.

For a single EHR, it's about what is reasonable. If it is an inpatient based CDM, then pretty much would limit to hospitalizations plus 30. In outpatient, then your rules may vary based on demographics (e.g., older patients are seen more frequently, young females are seen more than young males, etc) but would give at least a year after last event.

If HIE, I would do the type of analysis we did at Regenstrief in terms of silent periods, then decide the right level of conservatism in terms of reasonable persistence of observable period.

wanghaisheng commented 9 years ago

I'm adding to this post to give the GL some food for thought, and get some insights if anyone else is mulling over this:

Given:
a. SAFTINet project stores both clinical and claims data, linked at the patient level, in the same OMOP instantiation.
b. It is possible for us to have only clinical data, but it is never possible for us to have only claims data.
c. We want to remain consistent with OMOP conventions to the greatest extent possible.

Question under consideration: What is the best way to create the observation_period dates to support use of Achilles and are other research analytic needs?

Proposal 1: Create both “clinically-derived” and “Insurer-enrollment-plan-derived” observation periods
a. Clinically-derived observation period –
i. Alternative 1a:
1. Obs_period_start_date = Date that clinical source data is first available from the “clinical data source” and
2. Obs_period_end_date = Date that clinical source data is last available from the “clinical data source”
a. Consequence-this would not allow us to consider evidence of care about which we are aware via claims data, such as a medication refill.
b. Requires that we know the source of the data. (We will have this information and we’ve added this field to all tables due to our use case of combined clinical and claims sources in a single instantiation)
c. When necessary, we could combine the use of the clinically-derived and claims-derived observation periods in an analysis, which offsets consequence above.
b. Insurer-enrollment-derived observation period=
i. Alternative 1b: (using OMOP convention)
1. start date = start date of enrollment (payment plan period start date) and
2. end date = end date of enrollment (payment plan period end date)
c. This proposal would violate this CDMV5 convention : “each person can have more than one valid observation-period record, no tow observation periods can overlap in time for a given person”

Proposal 2: Use a “combined clinical and claims data”’ observation period
a. Obs_period_start_date (~ end_date) – first (or last) date of any “clinical or claims” data being recorded, but not including payment plan period dates.
b. This essentially mirrors the concept of the clinically-derived observation period but allows us to use all the information available -- ‘I only hear it if it happens in within range’ but includes claims signals in addition to clinical signals as ‘in range’. This is in contrast to the Insurer-enrollment-derived obs_period – that is used for the assumption of “I’m listening from X time to Y time, and if I don’t hear it (a claim) nothing happened”
c. This proposal does not violate the CDM V5 convention – 3c, above

Proposal 3: use enrollment/payment plan period (3b above)-
a. I don’t have this information for all persons in a dataset, some patients have only clinical, some have clinical and claims, so I would only have observation periods on some persons. I also only have Medicaid claims, so if a person changes from Medicaid to private insurance- I have payer data for some period but all periods.

Proposal 4: use clinical when only clinical data is available and claims when (clinical &claims) is available – would not work- see 5a for reasoning.

For our use case of combining clinical and claims data –I’m leaning towards Proposal 2, as it best represents why we combined clinical and claims data from a research perspective. But I’m wondering if others in the community have thought this through and if/how this decision impacts the use of Achilles.

Thanks, Lisa Schilling Reply as linked Topic [Vojtech_Huser] Vojtech_Huser Dec '14

wanghaisheng commented 9 years ago

, here is how the table is generated for GE dataset in IMEDS lab. A 12M windows is used. So we have 1M, 12M and 36M used in the community.

PASTE FROM THE ETL DOCUMENTATION

3.7 Table Name: OBSERVATION_PERIOD Person status during an Observation period in the GE EHR data is captured from the ACTIVITY_F table. Enrollment entries are consolidated by combining records that indicate continuous enrollment over a period of time start and end dates to cover the period. If enrollment data indicates a person’s coverage did not extend through the entire duration covered by claims data, then multiple Observation Periods are recorded to capture all periods of coverage with corresponding start and end dates. The consolidation is done through the following steps:

Records for each person are sorted in an ascending order of dtStart (Start Date).
Periods of continuous enrollment consolidated by combining records and recording the ACTIVITY_DATE for the first period as the Observation start date and end date for the last period as the Observation end date. If the time between the end of one enrollment period and the start of the next is 12 months or less, treat this as continuous enrollment.
No records are added to cover gaps in coverage.

wanghaisheng commented 9 years ago

对于我们的提示是 1、涉及医保的以医保的承保范围为准 2、不涉及医保的，以记录与记录之间的时间窗长度作为评判标准

wanghaisheng / OHDSI-Research