Challenges in data generation and harmonization

jaybee84 commented 4 years ago

add seminal work references
add description

allaway commented 4 years ago

Techniques to manage disparities in data generation are required to power robust analyses in rare diseases: Rarity of patients leads to heterogeneity in sample collection, causing disparities in the data. We will discuss how rigorous normalization and methodologies capturing sample-wise gene-set level information can help appropriate integration of disparate data points to power machine learning approaches11–13.

allaway commented 4 years ago

There's a lot to possibly talk about here, so let's break this down by data type:

Gene expression: assessing heterogeneity

MDS
PCA
tSNE
UMAP

Gene expression: correcting heterogeneity

batch effect correction methods (regression-based modeling of batch effects - ComBat, limma)
gene expression heterogeneity analysis methods (sva)
among many others

Variant data: assessing heterogeneity

dimensionality reduction methods on binarized variant data

Variant data: correcting heterogeneity

Filtering to enrich for computationally predicted high-impact mutations (SIFT, Polyphen, etc)
Filtering out common variants in the population (ExAC, dbSNP VAF < 0.01)
Reprocessing raw data with a common pipeline
Comparing to targeted sequencing panels that generally have a better detection rate
Ensembling multiple variant calling tools. Anecdotally, we observed that a tumor variant detected on a targeted sequencing panel was only detectable in the whole exome sequencing data from the same sample using DeepVariant- this manuscript is accepted at Sci Data so need to revise the doi before submitting
Methods like SHEAR for structural variant detection

jaybee84 commented 4 years ago

adding this paper for consideration in the high-impact mutation prediction point (along with SIFT and Polyphen) ... seems like a good resource for diseases with multigenic possibilities!

(conflict of interest disclaimer: I may have a soft spot for ensemble RFs :P )

allaway commented 4 years ago

An important thing to acknowledge: batches/processing cores/institutions are often confounded by the biological variables, like tumor type, disease state, etc.

jaybee84 / ml-in-rd