wanghaisheng commented 9 years ago

1、官方网站

第一部分、Who We Are

1、项目的mission

2、愿景

3、目标

4、指导原则

5、关注的领域

6、FAQ

第二部分、Who We Serve

1、数据所有者/保管机构

2、研究人员

3、医务人员

4、患者、消费者

第三部分、Data Standardization

1、OMOP Common Data Model数据模型

2、CDM教程

3、术语资源

4、构建自己的CDM

第四部分、Software Tools

1、ACHILLES——数据特征化

2、HOMER——面向人群的估计

3、PLATO——面向个体的预测

4、HERMES——术语工具

5、HERCULES——质控报表

6、WhiteRabbit——ETL设计工具

第五部分、Resources

1、WIKI

2、社区论坛

3、邮件列表

4、各种库

5、发表的文章

6、PPT

7、合作机会Collaborator Opportunities

第六部分、Join the Journey

1、各个工作组

2、Research Network

3、Developer Network

wanghaisheng commented 9 years ago

二、背景介绍

The Observational Health Data Sciences and Informatics (or OHDSI, pronounced "Odyssey") 是一个跨学科的多个利益相关方参与的项目，旨在通过大规模的分析来提升医疗数据的价值。所产出的解决方案都是开源的。

OHDSI已经为世界范围内的科研人员和坐落于哥伦比亚大学的观察性医疗数据库建立起了一个网络。

项目的mission

通过对观察性医疗数据库的大规模分析形成疾病自然发展过程、医疗服务以及医疗干预的作用等可靠证据来支撑面向对群体的估计和对个体的预测

愿景

OHDSI参与人员可以访问数10亿患者的数据来得到医疗方方面面的证据。全世界所有的患者、医务人员和决策人员每天都在使用OHDSI的工具和所提供的证据

目标

形成一个观察性科研的社区，跨学科(统计学、计算机学、传染病学、物理学、信息学科、生命科学)跨组织(学术圈、医疗机构、保险商、药品生产机构、政府机构)的积极参与；
开发和评估利用观察性医疗数据来研究医疗干预手段的效果，预测个体的治疗效果，得到足以建立观察性分析最佳实践的实证证据的分析方法；
在观察性分析的系统设计和实施中应用最佳实践来实现医疗产品的风险识别、比较效果研究、针对个体的预测和医疗质量的改进；
得到疾病发展过程、医疗服务、医疗干预效果的循证依据来支撑临床决策；
提供对学生、从业人员、消费者的观察性医疗数据研究的理论知识的培训

指导原则
基于循证的: 科学研究和开发由客观的实验证据来驱动
以实用为主: 除了方法论的研究之外，开发解决方案，得到临床证据
全面性: 为所有临床干预和疗效形成可靠的医学证据
透明性: 所有解决方案开源且开放,包括源代码、分析结果和在活动过程中产生的其他证据；
Inclusive: 鼓励所有人员患者、医疗机构、医务人员、保险商、政府、业界、学术界等参与科研和开发的全过程；
安全性: 保护患者隐私，尊重数据提供方的利益

关注的领域

数据标准化—— 为了保证观察性研究的高质量、高效率和透明性，必须对组织结构、内容、分析方法进行标准化。采用OMOP通用数据模型来实现数据标准化。同时提供一些其他工具来辅助大家实现OMOP 临床数据仓库以及适合每家机构的配套的分析工具
药品的安全监测—— 为了了解和评估不良反应与每种药品的关联性。主要通过建立一个开放给公众的国际性的风险识别和分析系统来达到这个目的。这样子就能够先验式的检测出一些潜在的药物反应。目前，我们正在开发能够实时检索所有药品和医疗疗效的大规模分析技术。同时，建立了一个开源开放的证据仓库，拥有观察性数据的任何人、任何机构都可以使用一些开源工具，分享他们的证据，相互学习；
比较效果研究—— 大众也希望了解一些替代的治疗方案，能够对这些治疗方案的效果进行比较。我们正在开发一些开源工具从观察性医疗数据中生成证据。与安全监测工具不同的是(解决的是药品是否带来了某种效果)，该工具力图解决，针对同一目的，这种药品是否比其他替代药品的效果好多少
个性化的风险预测—— 针对个体的预测模型能够与针对群体的估计形成互补，除了能够回答平均的治疗效果，还能够根据你既往的情况，比如人口学信息、病史、之前的健康状况等，对能否达到某种效果进行个性化的预测，提供与患者沟通交流风险信息的工具。
数据特征化—— 为了要从观察性医疗数据中学到知识，得到可靠的证据，必须理解源数据。大多数分析所用的数据并非是为了研究而收集起来的，保单是财务数据，健康档案/电子病历是医疗服务中产生的数据，因此，了解数据从哪里来对于解读数据至关重要，我们正在开发一些数据质量评估和数据库规范的工具，这样才能知道该数据库什么时候能用，什么时候不能用，在一次分析中要考虑那些数据问题。
质量的改进—— 医疗体系试图不断的改进医疗服务的质量。我们正在开发一些开源工具，通过存储于OMOP数据模型中的观察性医疗数据系统性的应用质控指标来更加容易的实现这样的目的。针对新生成的基于证据的质量指标进行设计、开发和评估

FAQ

OMOP与OHDSI的关系 OHDSI继承了OMOP对方法学的研究，对OMOP的数据模型进行了进一步的演化，将OMOP的研究人员收纳起来. OMOP仅限于方法论的研究，而 OHDSI旨在开发工具和应用方法来解决实际的问题
OHDSI与 MiniSentinel的不同之处? OHDSI 是一个开放的联盟，谁都可以参与，旨在风险识别，对比效果研究的系统开发和研究，以及从数据中进行大规模分析的方法和工具。 MiniSentinel是FDA赞助的项目，利用US范围内商业保险机构的数据库来解决FDA感兴趣的某些药品的安全问题。
谁负责OHDSI? OHDSI是一个多机构的合作组织，是按照OHDSI项目负责制来走的
经费来源是什么 OHDSI项目赞助来源主要是政府机构和业界的赞助

服务的对象

数据所有者/保管机构

保险公司、医疗机构、药企、学术机构都想通过已有的数据来挖掘出金子 OHDSI能够提供标准化数据、分析数据的工具和方法

研究人员

目前有大约70人 epidemiologists, statisticians, medical informaticists, computer scientists, and clinicians 有个讨论区

医务人员

患者、消费者

At some point in life we all become patients, and high-quality medical care is one of the most generic needs of mankind. Such care must be based on all the quantitative and hard evidence available about treatment options.

Today, not a lot of evidence is available to patients directly, and it is usually limited to describing the effects of treatments on a population. For example, we may be able to answer broad questions at a population level like “Does Drug A cause bleeding?” However, what if we could address that question at the patient level: “What is the likelihood that Drug A will cause me to bleed?” based on available known factors like medical history and health behaviors?

There isn’t an established way to generate this kind of evidence reliably. So, we at OHDSI are looking to help solve this problem and produce the individualized real-world evidence that any patient can use to get informed about his or her own medical situation.

wanghaisheng commented 9 years ago

数据标准化

对于协作式、大规模的研究分析而言，数据的标准化是至关重要的，也就是说所有数据是否拥有同样的格式

机构间的数据千差万别，数据采集的目的也各异，数据存储的格式也不同，不同机构间所采用的术语也会导致同样的概念有不同的表达方式,

OHDSI采用OMOP CDM数据模型. 也提供了将各自的数据库转换成CDM的工具和资源，提供了基于CDM数据模型之上的数据分析工具

1、OMOP Common Data Model数据模型

统一的数据模型示意图统一的数据模型是异构数据库进行系统分析的基础。 The Observational Medical Outcomes Partnership (OMOP) CDM,已经演化到第5个版本。

2、CDM教程

3、术语资源

标准化术语是基础工具， OMOP 术语: http://omop.org/vocabularies 见下表


Domain	Type	Vocabulary	Restricted
Demographic	Standard terminology	HL7 Administrative Sex
		OMB Ethnicity
		CDC Race
Drug	Standard terminology	RxNorm
		WHO ATC
		VA Class
		NDF-RT
		FDB ETC	Yes
	Mapped coding scheme	Cerner Multum
		NDC
		FDA SPL
		FDB Drug Product	Yes
		FDB Indication	Yes
		Medi-Span GPI	Yes
		Multilex	Yes
		NLM MeSH
		VA Product
Condition	Standard terminology, classification	SNOMED-CT
		MEDRA	Yes
	Mapped coding scheme	ICD-10-CM
		ICD-9-CM
		OXMIS
		Read
Procedure	Standard classification	SNOMED-CT
	Standard terminology	ICD-9-Procedure
		HCPCS
		CPT-4	Yes
	Mapped coding scheme	ICD-10-PCS
Cohort	Analysis	SMQ	Yes
		OMOP DOI
		OMOP HOI
Observation	Standard terminology, classification	SNOMED-CT
		LOINC
		UCUM
	Standard classification	LOINC Multidimentional Classification
Provider	Standard terminology	NUCC
		CMS Speciality
Visit	Standard terminology	OMOP Visit
		CMS Place of Service
Cost	Standard classification	MDC
	Standard terminology	Revenue Code
		DRG
		APC
Concept Type	Standard terminology	OMOP Condition Occurance Type
		OMOP Procedure Occurance Type
		OMOP Observation Type
		OMOP Drug Exposure Type
		OMOP Death Type

Query vocabulary: http://vocabqueries.omop.org/

4、构建自己的CDM

前提条件:

了解源数据，原始的采集流程和在医疗业务中的角色
理解药品和疾病的医学原理
在epidemiology, pharmacovigilance, health economics and outcomes research中的专业知识
Command of advanced statistical techniques for large-scale modeling and exploratory analysis
本体管理和标准化术语利用Informatics experience with ontology management and leveraging standard terminologies for analysis
编程知识技巧Technical/programming skills to implement design and develop a scalable solution

步骤：

培训OMOP CDM 和术语 Vocabulary
讨论有了CDM之后要进行哪些数据分析
评估技术需求
讨论数据字典和原始数据库
对源数据库进行扫描
初步分析业务逻辑 a. 表层面 b. 字段层面 level c. 术语层面 d. 记录转换中丢失的数据
形成数据样本供初步分析
完成设计后再实现 A successful ETL requires a village; don’t make one person try to be the hero and do it all themselves o Team design o Team implementation o Team testing Document early and often, the more details the better Data quality checking is required at every step of the process Don’t make assumptions about source data based on documentation; verify by looking at the data Good design and comprehensive specifications should save unnecessary iterations and thrash during implementation ETL design/documentation/implementation is a living process. It will never be done and it can always be better. But don’t let the perfect be the enemy of the good

5、OHDSI讨论组

wanghaisheng commented 9 years ago

分析工具

示意图

这些工具能够充分利用高级可视化、分析方法、交互式的探索数据。工具的代码全部发布在GitHub.

1、ACHILLES——数据特征化

对OMOP CDM v4 数据库的统计分析.该软件于在San Diego召开的2014 EDM论坛上发布。演示地址请点击Demo.

ACHILLES上可以对数据库进行特征化、质量评估和可视化。可供用户以一种交互式的方式来评估患者的人口学信息，病情、药物和手术的的流行程度，评估临床观察值的分布情况

只要有个案数据的话就可以在本地部署ACHILLES

ACHILLES 有2大组件，第一个是用R实现的，本地运行，不会泄露任何个人信息。该R包要求数据格式符合OMOP统一数据模型. 该R包能够生成和导出描述个案数据库的质量和内容的统计数据. 第二个是用HTML5 / JavaScript实现的前台界面，提供交互式报告来可视化和探索统计数据。单一前台界面可以配套多个后台数据库。

第一部分: 生成统计数据的R包(https://github.com/OHDSI/Achilles)

第二部分: 可视化统计数据的web界面(https://github.com/OHDSI/AchillesWeb)

2、HOMER——面向人群的估计

观察性医疗数据例如电子病历和医保索赔单据能够为健康、疾病和药品的研究带来无限大的价值。目前医疗领域中大数据分析的范式主要还是以 episodic in nature为主，比方说，某个研究人员就某种关联关系(比如某种药物与某种疗效的关系)提出了一种假设，设计了一项观察性分析试验来检验假设，在某个开放的观察性数据库上执行这个试验分析，恰巧p<0.05,差异有统计意义，然后他试图以同行评审的文献，在论坛上传播他的发现. 一般而言，就因果关系的假设检验主要集中在产生无偏倚的对关联强度的估计，决定是否相对风险 risk metric 是否足够来否决没有效果的无效假设。这种范式会带来如下几个问题：

目前的研究流程效率太低— evidence is generated to support one hypothesis at a time, and the number of questions about disease and medical products that patients and providers deserve reliable evidence about are growing at a pace that outstrips the output of the entire research enterprise. For example, across all pharmaceutical drugs and all health outcomes of interest, only 4% of combinations have evidence in the published literature from randomized clinical trials or observational studies; 96% of the potential questions remain unasked and therefore unanswered.
研究得到的证据不可靠 — estimates of strength of association from observational database analyses are subject to systematic error which bedevils the field of epidemiology. Repeated examples illustrate the challenge in proper analyses, as different research groups attempting to answer the same question on the same data generate conflicting results (such as bisphosponate-esophageal cancer, pioglitazone-bladder cancer), and findings across observational databases on the same issue fail to replicate (such as flouroquinalone-retinal detachment, dabigatran-bleeding). Issues of data source heterogeneity and method parameter sensitivity make it critically important to explore multiple databases and multiple analysis choices when addressing a particular product-outcome association, but conducting multiple large-scale analyses across disparate sources and synthesizing results across the analyses is difficult.
得到的证据不足以解决因果性的问题— most observational database analyses provide estimates of strength of association, and when statistically significant findings are observed, offer post-hoc rationalizations for biologic plausibility. Austin Bradford Hill outlined in 1965 [14] many facets that bear consideration when considering a causal effect, including strength of association, plausibility, consistency, temporality, biologic gradient, analogy, specificity, experiment, and coherence. These viewpoints have been applied to specific pharmaocovigilance analyses [27], but have not been consistently adopted or systematically applied in the context of observational data studies. An open opportunity for the novel use of observational data involves developing exploratory analyses for each of these causal dimensions, as well some novel dimensions, to strengthen the interpretation of any purported effect. In addition we propose to develop quantitative metrics associated with each of these dimensions.
目前所使用的数据和得到的结果都是静态的— patient-level data are summarized in a series of statistics that populate tables in a manuscript. The level of detail provided about the underlying data and analysis methods applied to the data is often not sufficiently transparent to evaluate the integrity of the study, and because the patient-level data are not publicly available, the analyses are often not reproducible. Yet, most study results stimulate more questions than they answer. For example, if we find that dabigatran causes bleeding, the community will immediately want to go further to ask: is the effect observed for all indications of the treatment or for all patient subgroups within each indication? Do other anticoagulants have similar effects? If the drug causes gastrointestinal bleeding, are there other hemorrhagic conditions that it is also associated with? Do observed associations persist as observational data accumulate, health care delivery evolves, and the practice of medicine learns from prior work to develop interventions intended to maximize benefits and minimize risks of treatments?
要解决这些问题需要一种迭代式的数据分析方法s, one which facilitates exploration of summary results while protecting patient privacy, through coordination of an observational data network of disparate sources that provide timely access to current summary analysis results on an on-going basis.

为了解决这些问题，也就是说我们要设计、实现和部署Health Outcomes and Medical Effectiveness Research (HOMER) 这样的一个系统。HOMER 是一个交互式的可视化平台，研究所人员可以在观察性数据库形成的网络之上探索关联关系. 我们会提供标准化的大规模分析的工具来提取统计数据，提供一个web节目供实时研究统计数据。

从大数据的四大维度思考: 观察性医疗数据库是持续增长的，多数数据库中超过了 100 million patients ，包含数以十亿记的临床观察项；在整个数据网络之上, 在电子病历和医保过程中产生的数据也满足多样性的特点；而数据的veracity则依赖于我们解读数据的能力，能否将其转换成对个体生活体验的精确预测, 获取某个药物在人群中的无偏倚效果. Healthcare data offers substantial velocity, with clinical observations captured every day, and large-scale analyses are expected to be executed on a regular basis, if not real time, to ensure the timeliness of all evidence generated. However, the “big data” problem goes beyond the patient-level sources; for example, with 10,000s of medical interventions and 1000s of health outcomes of interest to patients, a comprehensive summary of all potential effects constitutes “big results”; if we estimate that 1000 summary statistics will be needed to properly characterize a single drug-outcome effect, then the result set to explore all drugs and all outcome should be expected to exceed 10 billion.

因此，综述性质的结果需要一种新的研究方法，一个人不可能人工对所有信息进行回顾来辨别潜在的效果，衡量信息的准确度来破除谣传的效果，这就需要一种基于交互式可视化技术的大规模的探索框架，研究人员可以对结果进行过滤、缩放、平移，将结果与正交分析组件进行关联，从而得出药物作用的 evidence-based story或者找出关联的原因。而且，大规模的分析结果可供对研究方法的可靠性和性能进行大规模的评估，为如何从这些结果中学习以及在某个时间节点结果到底有多少可信度提供进一步的证据

HOMER框架始自 Sir Austin Bradford Hill’s因果关系的考虑因素( causal considerations.)，对于其中每一个因果关系组件我们会开发大规模分析的解决方案， strength of association, consistency, temporality, experiment, plausibility, coherence, biologic gradient, specificity, and analogy. 每个组件包含两部分，一个是从个案数据库中得到统计数据的方法，一个是对统计数据进行可视化的方法。在工具中，你可以自由选择任何一种药物和任何一种效果来研究所有与药物-效果相关联的证据，要完成大规模的分析也就是说开发的工具的计算性能优越，能够处理包含数以百万计的个案数据集, 能够在上百万个协变量之间应用复杂的regularization strategies进行confounding adjustment ，同时也能够对上百万对药物-疗效进行研究。

HOMER的所有部件都是开源开放的。

3、PLATO——面向个体的预测

对个案治疗效果的评估：根据患者的病史，评估患者在接受某种干预之后是否会出现某种治疗效果的可能性的预测模型

Patients seek medical care to diagnose and treat illness. Current medical practice relies on limited aggregate information for prognosis and prediction of a patient’s health. When predictive models are used in healthcare they draw on data from hundreds to thousands of patients and consider small numbers of patient characteristics, often five or fewer. This contrasts sharply with the reality of modern medicine wherein patients generate a rich digital trail, which is well beyond the power of any medical practitioner to fully assimilate. The recent emergence of massive patient-level databases of electronic health records and administrative claims opens up extraordinary opportunities for massive-scale, patient-specific predictive modeling. Such models can inform truly personalized medical care leading hopefully to sharply improved patient outcomes.

我们使用累积多年的1亿患者和超过50亿临床观察的连续性数据来开发预测模型。大的人口基数能够提供丰富的数据来构建高效的预测模型，也能为更多的需要改进医疗质量的患者提供即时服务。对数据的有效研究需要新的方法论和跨学科的协作。我们坚信，依靠OHDSI的综合背景，能够访问这么大的数据量，通过竭诚合作定能在这个领域有所突破。

主要是基于irregularly-spaced 的电子病历数据来研究一些模型和算法来衍生出与临床紧密相关的预测模型，研究如何使用这些信息进行大规模变量建模的算法,评估个案层面的效果预测准确性的性能. Predictive modeling in databases containing data for upwards of 100 million patients presents non-trivial engineering challenges. 但我们的团队很熟悉这些数据，并且已经架设好了一个量身定做的计算环境，主要目的还是在与确定开发精准的个性化预测模型的标准化流程。

预测模型是基于OMOP数据模型的，工具都是开源开放的。 Person-Level Assessment of Treatment Outcomes (PLATO) will be an integrated framework to allow all users to use the library of predictive models developed to produce individualized risk for all medical interventions and all health outcomes of interest, based on personal demographics, medical history, and health behaviors.

4、HERMES——术语工具

Health Entity Relationship and Metadata Exploration System (HERMES)

HERMES是一个web工具，可以查询和检索存储在 OMOP Common Data Model (CDM)中的术语。 . 同时也支持术语的管理和导出功能。 HERMES包括了HTML / JavaScript 实现的前台界面和访问 OMOP CDM 术语资源的后台服务。

5、HERCULES——质控报表

Health Enterprise Resource, Care, and Utilization Learning Exploration System (HERCULES): 标准化的描述性报表：医疗机构可以分析哪些地方可以改进，与类似的医疗机构的基线进行比较

HERCULES可以分析医疗质量、成本、医疗实践的模式，也是基于OMOP 统一数据模型的。 HERCULES提供了可视化工具和进一步深挖质控指标的能力，也可在多个患者队列中应用

6、WhiteRabbit——ETL设计工具

WhiteRabbit是一个用来辅助机构将数据导出成OMOP统一数据模型CDM的ETL工具。源数据库的数据可以是csv、(MySQL, SQL Server, ORACLE, PostgreSQL); CDM可以是(MySQL, SQL Server, PostgreSQL)任意一种数据库

WhiteRabbit的核心功能是对源数据进行扫描，提供每个表、字段、值的详细信息。扫描之后会得到一份报告，可以作为设计ETL过程的参考依据，for instance when using the Rabbit-In-a-Hat tool. Rabbit-In-a-Hat uses the scan document and displays source data information through a graphical user interface to allow a user to connect source data structure to the CDM data structure. The function of Rabbit-In-a-Hat is to generate documentation for the ETL process, not generate code to create an ETL.

Download WhiteRabbit: https://github.com/OHDSI/WhiteRabbit

wanghaisheng commented 9 years ago

离线的术语资源打包下载 1、google drive

wanghaisheng / OHDSI-Research

官方网站—文档汇总 #1

第一部分、Who We Are

1、项目的mission

2、愿景

3、目标

4、指导原则

5、关注的领域

6、FAQ

第二部分、Who We Serve

1、数据所有者/保管机构

2、研究人员

3、医务人员

4、患者、消费者

第三部分、Data Standardization

1、OMOP Common Data Model数据模型

2、CDM教程

3、术语资源

4、构建自己的CDM

第四部分、Software Tools

1、ACHILLES——数据特征化

2、HOMER——面向人群的估计

3、PLATO——面向个体的预测

4、HERMES——术语工具

5、HERCULES——质控报表

6、WhiteRabbit——ETL设计工具

第五部分、Resources

1、WIKI

2、社区论坛

3、邮件列表

4、各种库

5、发表的文章

6、PPT

7、合作机会Collaborator Opportunities

第六部分、Join the Journey

1、各个工作组

2、Research Network

3、Developer Network

二、背景介绍

项目的mission

愿景

目标

指导原则

关注的领域

FAQ

服务的对象

数据所有者/保管机构

研究人员

医务人员

患者、消费者

数据标准化

1、OMOP Common Data Model数据模型

2、CDM教程

3、术语资源

4、构建自己的CDM

5、OHDSI讨论组

分析工具

1、ACHILLES——数据特征化

2、HOMER——面向人群的估计

3、PLATO——面向个体的预测

4、HERMES——术语工具

5、HERCULES——质控报表

6、WhiteRabbit——ETL设计工具