jaybee84 / ml-in-rd

Manuscript for perspective on machine learning in rare disease
Other
2 stars 1 forks source link

Challenges in data generation and harmonization #4

Closed jaybee84 closed 4 years ago

jaybee84 commented 4 years ago
  1. add seminal work references
  2. add description
allaway commented 4 years ago

Techniques to manage disparities in data generation are required to power robust analyses in rare diseases: Rarity of patients leads to heterogeneity in sample collection, causing disparities in the data. We will discuss how rigorous normalization and methodologies capturing sample-wise gene-set level information can help appropriate integration of disparate data points to power machine learning approaches11–13.

allaway commented 4 years ago

There's a lot to possibly talk about here, so let's break this down by data type:

Gene expression: assessing heterogeneity

Gene expression: correcting heterogeneity

Variant data: assessing heterogeneity

Variant data: correcting heterogeneity

jaybee84 commented 4 years ago

adding this paper for consideration in the high-impact mutation prediction point (along with SIFT and Polyphen) ... seems like a good resource for diseases with multigenic possibilities!

(conflict of interest disclaimer: I may have a soft spot for ensemble RFs :P )

allaway commented 4 years ago

An important thing to acknowledge: batches/processing cores/institutions are often confounded by the biological variables, like tumor type, disease state, etc.