sritchie73 / ukbnmr

Tools for processing Nightingale NMR biomarker data in UK Biobank
Other
26 stars 1 forks source link

the row missing value rates increased after remove_technical_variation() function #5

Open shengzheBian opened 1 year ago

shengzheBian commented 1 year ago

Hello,

After processing my data using the software, I noticed an increase in the missing value rate for certain rows (boxplot figure about raw data and processed data). I would like to understand the reason behind this. Could it be related to a warning message that appeared while running remove_technical_variation() function(as below)? Additionally, I would appreciate guidance on how to handle samples with a high missing value rate.

Thank you for your assistance.

图片 图片 图片

sritchie73 commented 1 year ago

Hi,

This is expected; part of the QC pipeline is to remove (by setting to NA) samples that are located on outlier plates of non-biological origin.

The best example of this is given in Figure 5 of our Scientific Data publication: https://www.nature.com/articles/s41597-023-01949-y/figures/5

The Figure shows Albumin as an example, where we can see that when grouping Albumin concentrations by Shipping Plate, some plates have extremely high concentrations which are not reflected in the clinical chemistry measurements (UK Biobank field #30600) for those same samples.

This removal of samples on outlier plates is performed on a per-biomarker basis, I.e. some samples will have NAs for some biomarkers but not others for this reason.

As for handling these samples, personally I've found that setting a missingness threshold of 5% on the 107 non-derived biomarkers works well: samples with more than 5% (5 or more biomarkers) missing tend to be samples located on extreme outlier plates for multiple biomarkers, so in my own work I justify removing these samples as these plates are potentially unreliable across all biomarkers. Samples with <5% missingness tend to have missing values for other unrelated reasons, i.e. have been set to missing in the data released by UK Biobank.

If you want to find the reason for any NA value, you can cross-check the biomarker concentrations with the biomarker QC flags returned by remove_technical_variation function.

You can also opt to not exclude these outlier plates, by using remove.outlier.plates=FALSE as an argument to remove_technical_variation.

The additional warning message you see is unrelated - I tracked this down to four samples whose well position was not being correctly handled: for some reason they are lowercase for these four samples, but uppercase for the rest. I've fixed this in version 2.2.1 so that these four samples are set to have uppercase well position like the rest of the samples (see issue #6).