philipdarke / ukbb-ehr-data

Prepare UK Biobank Electronic Health Record data for research
MIT License
25 stars 4 forks source link

Missing deduction dates in gp registration data #1

Closed xiaonanl1996 closed 2 years ago

xiaonanl1996 commented 2 years ago

Hi,

First of all I want to thank you for creating such a useful tool! Many thanks for providing all the great details on cleaning UKB data. I was studying the R script for cleaning primary care data ("01_prepare_data/02_clean_data.R") and I noticed for the part that cleans the registration data (line 62-67, copied below):

gp_reg <- gp_reg[!is.na(reg_date) & # missing registration dates reg_date <= censor_date & # starting after the censor date reg_date != na_dates$pre_dob & # pre-birth registrations reg_date != na_dates$future & # future registrations deduct_date != na_dates$pre_dob & # pre-birth deduction dates reg_date < deduct_date] # registration after deduction

The code above would accidentally remove all the missing deduction dates, which I think is unnecessary and would end up removing lots of data. Please correct me if I'm wrong, having a missing deduction dates could mean the participant did not leave the primary care practice.

My solution is below (Sorry I'm not really a data.table user):

gp_reg<-gp_reg%>% dplyr::filter( !is.na(reg_date) & reg_date != na_dates$pre_dob & reg_date != na_dates$future )%>% dplyr::filter(is.na(deduct_date) | (!is.na(deduct_date) & deduct_date!=na_dates$pre_dob & reg_date<deduct_date))

Many thanks again.

philipdarke commented 2 years ago

Thanks for getting in touch.

Periods of registration with missing de-registration/deduction dates were assumed to be ongoing at the date of data extract. Lines 57-59 replace these missing dates with the earlier of a) the estimated end of data collection (first date in table 6 here) and b) date of death (if any) in the linked death registry data:

# End practice registration periods at censor date
gp_reg[is.na(deduct_date) | deduct_date > censor_date,
       deduct_date := censor_date]

Therefore the code you highlight does not remove any records with missing de-registration dates (it would but there aren't any).

Take a look at section 1 of the supplementary material to the JAMIA paper for more information.

Also check out https://dtplyr.tidyverse.org if you want to use data.table with the tidyverse syntax. You might struggle to work with the gp_event and gp_script data using data.frames or tibbles.

Please let me know if you spot anything else - feedback is always welcome. I'll close the ticket if I don't hear from you.

xiaonanl1996 commented 2 years ago

Thank you so much for your prompt reply! Yes you are right, I missed the part where you assigned missing deduct_date to censoring dates. I'm happy for you to close the issue, thanks again!