Closed xiaonanl1996 closed 2 years ago
Thanks for getting in touch.
Periods of registration with missing de-registration/deduction dates were assumed to be ongoing at the date of data extract. Lines 57-59 replace these missing dates with the earlier of a) the estimated end of data collection (first date in table 6 here) and b) date of death (if any) in the linked death registry data:
# End practice registration periods at censor date
gp_reg[is.na(deduct_date) | deduct_date > censor_date,
deduct_date := censor_date]
Therefore the code you highlight does not remove any records with missing de-registration dates (it would but there aren't any).
Take a look at section 1 of the supplementary material to the JAMIA paper for more information.
Also check out https://dtplyr.tidyverse.org if you want to use data.table
with the tidyverse syntax. You might struggle to work with the gp_event and gp_script data using data.frames or tibbles.
Please let me know if you spot anything else - feedback is always welcome. I'll close the ticket if I don't hear from you.
Thank you so much for your prompt reply! Yes you are right, I missed the part where you assigned missing deduct_date
to censoring dates. I'm happy for you to close the issue, thanks again!
Hi,
First of all I want to thank you for creating such a useful tool! Many thanks for providing all the great details on cleaning UKB data. I was studying the R script for cleaning primary care data ("01_prepare_data/02_clean_data.R") and I noticed for the part that cleans the registration data (line 62-67, copied below):
gp_reg <- gp_reg[!is.na(reg_date) & # missing registration dates reg_date <= censor_date & # starting after the censor date reg_date != na_dates$pre_dob & # pre-birth registrations reg_date != na_dates$future & # future registrations deduct_date != na_dates$pre_dob & # pre-birth deduction dates reg_date < deduct_date] # registration after deduction
The code above would accidentally remove all the missing deduction dates, which I think is unnecessary and would end up removing lots of data. Please correct me if I'm wrong, having a missing deduction dates could mean the participant did not leave the primary care practice.
My solution is below (Sorry I'm not really a data.table user):
gp_reg<-gp_reg%>% dplyr::filter( !is.na(reg_date) & reg_date != na_dates$pre_dob & reg_date != na_dates$future )%>% dplyr::filter(is.na(deduct_date) | (!is.na(deduct_date) & deduct_date!=na_dates$pre_dob & reg_date<deduct_date))
Many thanks again.