Open sharrison5 opened 3 years ago
Difference between full, short & screening assessments
With @kylahorne.
full_assessment
in the database) contain the full battery of ≈25 neuropsych tests (i.e. that are used for global_z
), as well as all the other measures.However, note that short sessions should still have an associated significant other session (i.e. NPI etc should still be present).
Extra information after discussions with Leslie (2020-11-14)
The 2015 cohort was screened, so the baseline sessions are nearly all minimal screening sessions (study == "Screening PD"
or session_suffix == "S1"
). These included MoCA, HADS, a handful of neuropsych tests (but no UPDRS, and no significant other session).
Missing NPI data from 2015 / 2019
There are two obvious patterns in missing NPI sessions: for the cohort enrolled in 2015, nearly all the baseline sessions are missing NPI data, and then nearly all the sessions collected in 2019. However, NPI data is present from 2019 subjects who were enrolled from 2007 to 2011, so I don't think this is just an out-of-date spreadsheet issue. For the 2015 sessions, @kylahorne has confirmed the data is missing in raw form (i.e. was likely never collected).
If this was due to these being from a different type of session, where NPI data isn't to be expected, then it would be useful to have that recorded somewhere. Any further information much appreciated!
Data source: npi_2020-07-13.csv
from @zenourn. NPI_apathy_present
is coded as TRUE / FALSE
for the answer to the apathy screening question, and NA
indicates missing data.
Example 2019 sessions from 2015 cohort:
full_data %>%
filter(lubridate::year(date_baseline) == 2015) %>%
filter(lubridate::year(session_date) == 2019) %>%
select(session_id, NPI_apathy_present)
# A tibble: 81 x 2
session_id NPI_apathy_present
<chr> <lgl>
1 158PDS_2019-02-14 FALSE
2 158PDS_2019-12-06 NA
3 163PDS_2019-01-17 FALSE
4 164PDS_2019-10-08 NA
5 165PDS_2019-08-01 NA
6 168PDS_2019-03-19 NA
7 170PDS_2019-10-07 NA
8 172PDS_2019-08-01 NA
9 173PDS_2019-05-17 FALSE
10 177PDS_2019-04-05 TRUE
# … with 71 more rows
Example 2019 sessions from pre-2012 cohort:
full_data %>%
filter(lubridate::year(date_baseline) < 2012) %>%
filter(lubridate::year(session_date) == 2019) %>%
select(session_id, NPI_apathy_present)
# A tibble: 70 x 2
session_id NPI_apathy_present
<chr> <lgl>
1 004KJS_2019-01-21 FALSE
2 008BIO_2019-01-12 TRUE
3 008PHG_2019-08-26 FALSE
4 009BIO_2019-03-14 FALSE
5 010MAN_2019-07-16 FALSE
6 014BIO_2019-03-14 FALSE
7 014RWW_2019-07-17 FALSE
8 016BIO_2019-05-23 NA
9 018BIO_2019-03-07 FALSE
10 021R-J_2019-06-28 TRUE
# … with 60 more rows
@sharrison5 I have just looked at the first 5 NA NPI entries from your code and 4 of them have raw NPI data from their 2019 sessions. @zenourn did the raw NPI data file you generated use the data include data from access and REDCap or just data from access?
That NPI file was an export from REDCap and was the exact same one used for your paper. Looking in npi_2020-07-13.csv, for 158PDS_2019-12-06 for apathy present (cell AN1267) I get == 1 and not NA. I might be missing something, but is there possibly a code issue here?
This is super weird, you're right. There's something going on with the session_id
in my spreadsheet (see row 10):
read_csv(file.path("..", "Data", "npi_2020-07-13.csv")) %>%
filter(subject_id == "158PDS") %>%
select(subject_id, session_id, npi_date, npi_g_present)
[[ Clipped column specification ]]
# A tibble: 10 x 4
subject_id session_id npi_date npi_g_present
<chr> <chr> <date> <dbl>
1 158PDS 158PDS_2015-03-06 2015-07-14 0
2 158PDS 158PDS_2015-10-07 2015-10-07 NA
3 158PDS 158PDS_2016-04-27 2016-05-11 1
4 158PDS 158PDS_2016-10-04 2016-10-04 1
5 158PDS 158PDS_2017-03-23 2017-05-23 0
6 158PDS 158PDS_2017-09-12 2017-09-12 1
7 158PDS 158PDS_2018-02-13 2018-02-20 1
8 158PDS 158PDS_2018-08-22 2018-08-22 0
9 158PDS 158PDS_2019-02-14 2019-02-14 0
10 158PDS 158PDS_2015-02-10 2019-12-06 1
In case it helps:
samh@apollo:Data $ md5sum npi_2020-07-13.csv
b784ad6bf2cefe227afa776263a30aba npi_2020-07-13.csv
Yes, I have the same issues present in my copy of the file, it is a code issue but not your code! I've fixed the issue in the code from Kyla's paper that was used to generate this file. A few of us have worked on that code but isn't in Github :-( so hard to know when it was introduced.
There is a case in REDCap where record instances can become out of sync between instruments and the correct way to do the linkage back to sessions is shown here: https://github.com/nzbri/redcap/blob/master/R/import_example.Rmd
I've put the updated file in a Google Shared Drive that you should have access to - if you txt me your cell (0272222585) I'll txt you back the password for the encrypted archive.
Excellent detective work!!
I've just been going through the new spreadsheet — thank you!! While I think the issue is fixed, it's not a case that we end up with more sessions unfortunately:
> mean(is.na(old_data$NPI_total))
[1] 0.253689
> mean(is.na(new_data$NPI_total))
[1] 0.2548241
Rather, the updated spreadsheet pushes the 'missingness' into the 2015 baselines. However, the sessions do now appear to be correctly matched:
> read_csv(file.path("..", "Data", "npi_2020-07-13_v2.csv")) %>%
+ filter(subject_id == "158PDS" & lubridate::year(session_date) == 2015) %>%
+ select(subject_id, session_id, npi_date, npi_g_present)
[[ Clipped column specification ]]
# A tibble: 2 x 4
subject_id session_id npi_date npi_g_present
<chr> <chr> <date> <dbl>
1 158PDS 158PDS_2015-03-06 2015-07-14 0
2 158PDS 158PDS_2015-10-07 2015-10-07 NA
> new_data %>%
+ filter(subject_id == "158PDS" & lubridate::year(session_date) == 2015) %>%
+ select(subject_id, session_id, NPI_date, NPI_apathy_present)
# A tibble: 3 x 4
subject_id session_id NPI_date NPI_apathy_present
<chr> <chr> <date> <lgl>
1 158PDS 158PDS_2015-02-10 NA NA
2 158PDS 158PDS_2015-03-06 2015-07-14 FALSE
3 158PDS 158PDS_2015-10-07 2015-10-07 NA
> old_data %>%
+ filter(subject_id == "158PDS" & lubridate::year(session_date) == 2015) %>%
+ select(subject_id, session_id, NPI_date, NPI_apathy_present)
# A tibble: 3 x 4
subject_id session_id NPI_date NPI_apathy_present
<chr> <chr> <date> <lgl>
1 158PDS 158PDS_2015-02-10 2019-12-06 TRUE
2 158PDS 158PDS_2015-03-06 2015-07-14 FALSE
3 158PDS 158PDS_2015-10-07 2015-10-07 NA
There are still a few date mismatch errors, but these are sporadic enough that I'm not worried (though I will add an exclusion criterion):
> npi %>%
+ filter(abs(as.numeric(npi_date - session_date, units = "days")) > 120) %>%
+ select(session_id, session_date, npi_date)
# A tibble: 19 x 3
session_id session_date npi_date
<chr> <date> <date>
1 006BIO_2016-02-05 2016-02-05 2015-03-09 <- Fixed, date wrong in raw data
2 027LPR-C_2009-01-06 2009-01-06 2018-01-31 <- Fixed, date wrong in raw data
3 028BIO_2017-06-01 2017-06-01 2011-07-11 <- Fixed, date wrong in raw data
4 045JYH_2008-07-09 2008-07-09 2008-12-12 <- S.O. session was completed on 12/12/2008
5 047BIO_2017-02-16 2017-02-16 2007-03-02 <- Fixed, date wrong in raw data
6 056BIO_2009-06-25 2009-06-25 2008-07-02 <- Raw data dated as the day it was entered (i.e. different days for each measure). Have reverted to the date collected from the Alice record (i.e. 16-07-2009).
7 074BIO_2018-02-14 2018-02-14 2015-03-14 <- Fixed, date wrong in raw data. Multiple erroneous dates in file
8 080BIO_2017-05-10 2017-05-10 2017-01-09 <- Fixed, date wrong in raw data. Multiple erroneous dates in file
9 137ADL_2016-09-30 2016-09-30 2016-03-30 <- Fixed, date wrong in raw data.
10 158PDS_2015-03-06 2015-03-06 2015-07-14 <- Fixed, date wrong in raw data + Alice
11 166PDS_2015-04-01 2015-04-01 2014-04-20 <- S.O. session was completed on 20/04/2015
12 195PDS_2015-11-02 2015-11-02 2015-05-09 <- Fixed, date wrong in raw data.
13 210PDS_2017-10-27 2017-10-27 2017-01-11 <- Fixed, date wrong in raw data.
14 258PDS_2017-09-01 2017-09-01 2018-09-01 <- Fixed, date wrong in raw data.
15 275PDS_2017-11-01 2017-11-01 2018-03-02 <- S.O. session was completed on 02/03/2018
16 303PDS_2018-01-19 2018-01-19 2001-01-19 <- Fixed, date wrong in raw data.
17 342PDS_2016-09-19 2016-09-19 2017-09-12 <- CANNOT LOCATE FILE. Assume that Alice is correct and have fixed, date wrong in raw data.
18 352PDS_2016-11-23 2016-11-23 2017-11-23 <- CANNOT LOCATE FILE. Assume that Alice is correct and have fixed, date wrong in raw data.
19 366PDS_2020-03-03 2020-03-03 2020-10-10 <- Fixed, date wrong in raw data.
Edit: full list added @zenourn.
Outliers in LED
Just noticed the following potential issue with LED based on the outlier in #16! This is for the subject below:
> full_data %>% filter(subject_id == "097BIO") %>% select(session_id, LED)
# A tibble: 6 x 2
session_id LED
<chr> <dbl>
1 097BIO_2010-02-09 0
2 097BIO_2012-03-06 440
3 097BIO_2014-01-28 1066
4 097BIO_2016-02-22 950.
5 097BIO_2018-02-01 1074.
6 097BIO_2020-01-14 8250.
Going forwards, what's the best way of dealing with things like this cropping up? Would it be easier for everyone else if I could modify these things myself?
Thanks! 😄
Thanks! In this case I think it is due to pramipexole being entered as 75 mg rather than 0.75 mg. I've fixed this and it should update after the next export (~10 mins).
This is a great way of dealing with things like this, I deal with a lot of data quality issues and can generally fix them relatively quickly most of the time.
Duplicate scan numbers
Some sessions have duplicate scan numbers (which seems to correspond to 2015 AnxS0
or CE1
sessions). Not a huge issue as the sessions are often close, but would be useful to know at which session the scan happened. Thanks!
sessions <- chchpd::import_sessions(exclude = FALSE)
sessions %>%
# Propagate scan numbers to same session_id
mutate(mri_scan_no = na_if(mri_scan_no, "None")) %>%
group_by(session_id) %>%
fill(mri_scan_no, .direction = "downup") %>%
mutate(studies = paste0(session_suffix, collapse = ", ")) %>%
ungroup() %>%
# Remove duplicate sessions
distinct(session_id, mri_scan_no, studies) %>%
# Extract duplicate scan numbers
filter(!is.na(mri_scan_no)) %>%
group_by(mri_scan_no) %>%
filter(n() > 1) %>%
ungroup() %>%
print(n = Inf)
# A tibble: 18 x 3
session_id mri_scan_no studies
<chr> <chr> <chr>
1 194PDS_2015-07-09 41098 AnxS0
2 194PDS_2015-09-09 41098 F0
3 204PDS_2015-05-21 40319 AnxS0
4 204PDS_2015-07-01 40319 F0
5 246PDS_2015-07-11 41099 AnxS0
6 246PDS_2015-11-03 41099 F0
7 260PDS_2015-10-21 42544 F0, PET0
8 260PDS_2015-12-04 42544 CE1
9 261PDS_2015-08-11 41682 AnxS0
10 261PDS_2015-10-20 41682 F0, PET0
11 271PDS_2015-07-21 41345 AnxS0
12 271PDS_2015-08-28 41345 F0
13 280PDS_2015-11-12 43122 F0
14 280PDS_2015-12-08 43122 CE1
15 289PDS_2015-08-06 41582 AnxS0
16 289PDS_2015-11-03 41582 F0, PET0
17 320PDS_2015-08-31 42046 AnxS0
18 320PDS_2015-10-23 42046 F0
CC: @tracymelzer
So this happened when people have been in the Anxiety study and had an MRI scan but then a couple of months later have a full assessment. There was no point in re-scanning so the scan from the previous session was used. The mri_scan_no here is the MRI scan number field in Alice for each session but we actually have a much more complete source for MRI data.
If you use mri = import_MRI()
you'll get a dataframe with the session_id, scan_no, scan_date, etc. This should not have any duplicates, however in cases where it links the MRI scan to the AnxS0 session, you actually might want to link it to the F0 session because the F0 session has much more complete data. Things have even become more complicated lately with MRI scans and cognitive assessments often having a huge gap due to COVID-19 where we could scan but not assess people. Given the MRI data you almost want to manually optimise what session it is best to link each scan to given what assessment data you require and a time delta penalty.
Awesome, thanks! That's great to know how it all fits together, and as you say it sounds like taking scan_date
and a bit of custom linking code is the way to go 👍
As suggested by @zenourn, this issue is a dynamic list of quick questions about the different study protocols and database itself. Some previous questions can be found in #1 and #2.
full_assessment
).More detail on individual items in comments.