nzbri / pd-apathy

Apache License 2.0
0 stars 0 forks source link

General protocol / database questions #8

Open sharrison5 opened 3 years ago

sharrison5 commented 3 years ago

As suggested by @zenourn, this issue is a dynamic list of quick questions about the different study protocols and database itself. Some previous questions can be found in #1 and #2.

More detail on individual items in comments.

sharrison5 commented 3 years ago

Difference between full, short & screening assessments

With @kylahorne.

However, note that short sessions should still have an associated significant other session (i.e. NPI etc should still be present).

Extra information after discussions with Leslie (2020-11-14)

The 2015 cohort was screened, so the baseline sessions are nearly all minimal screening sessions (study == "Screening PD" or session_suffix == "S1"). These included MoCA, HADS, a handful of neuropsych tests (but no UPDRS, and no significant other session).

sharrison5 commented 3 years ago

Missing NPI data from 2015 / 2019

There are two obvious patterns in missing NPI sessions: for the cohort enrolled in 2015, nearly all the baseline sessions are missing NPI data, and then nearly all the sessions collected in 2019. However, NPI data is present from 2019 subjects who were enrolled from 2007 to 2011, so I don't think this is just an out-of-date spreadsheet issue. For the 2015 sessions, @kylahorne has confirmed the data is missing in raw form (i.e. was likely never collected).

If this was due to these being from a different type of session, where NPI data isn't to be expected, then it would be useful to have that recorded somewhere. Any further information much appreciated!

Data source: npi_2020-07-13.csv from @zenourn. NPI_apathy_present is coded as TRUE / FALSE for the answer to the apathy screening question, and NA indicates missing data.

Example 2019 sessions from 2015 cohort:

full_data %>%
  filter(lubridate::year(date_baseline) == 2015) %>%
  filter(lubridate::year(session_date) == 2019) %>%
  select(session_id, NPI_apathy_present)
# A tibble: 81 x 2
   session_id        NPI_apathy_present
   <chr>             <lgl>             
 1 158PDS_2019-02-14 FALSE             
 2 158PDS_2019-12-06 NA                
 3 163PDS_2019-01-17 FALSE             
 4 164PDS_2019-10-08 NA                
 5 165PDS_2019-08-01 NA                
 6 168PDS_2019-03-19 NA                
 7 170PDS_2019-10-07 NA                
 8 172PDS_2019-08-01 NA                
 9 173PDS_2019-05-17 FALSE             
10 177PDS_2019-04-05 TRUE              
# … with 71 more rows

Example 2019 sessions from pre-2012 cohort:

full_data %>%
  filter(lubridate::year(date_baseline) < 2012) %>%
  filter(lubridate::year(session_date) == 2019) %>%
  select(session_id, NPI_apathy_present)
# A tibble: 70 x 2
   session_id        NPI_apathy_present
   <chr>             <lgl>             
 1 004KJS_2019-01-21 FALSE             
 2 008BIO_2019-01-12 TRUE              
 3 008PHG_2019-08-26 FALSE             
 4 009BIO_2019-03-14 FALSE             
 5 010MAN_2019-07-16 FALSE             
 6 014BIO_2019-03-14 FALSE             
 7 014RWW_2019-07-17 FALSE             
 8 016BIO_2019-05-23 NA                
 9 018BIO_2019-03-07 FALSE             
10 021R-J_2019-06-28 TRUE              
# … with 60 more rows
kylahorne commented 3 years ago

@sharrison5 I have just looked at the first 5 NA NPI entries from your code and 4 of them have raw NPI data from their 2019 sessions. @zenourn did the raw NPI data file you generated use the data include data from access and REDCap or just data from access?

zenourn commented 3 years ago

That NPI file was an export from REDCap and was the exact same one used for your paper. Looking in npi_2020-07-13.csv, for 158PDS_2019-12-06 for apathy present (cell AN1267) I get == 1 and not NA. I might be missing something, but is there possibly a code issue here?

sharrison5 commented 3 years ago

This is super weird, you're right. There's something going on with the session_id in my spreadsheet (see row 10):

read_csv(file.path("..", "Data", "npi_2020-07-13.csv")) %>%
  filter(subject_id == "158PDS") %>%
  select(subject_id, session_id, npi_date, npi_g_present)

    [[ Clipped column specification ]]

# A tibble: 10 x 4
   subject_id session_id        npi_date   npi_g_present
   <chr>      <chr>             <date>             <dbl>
 1 158PDS     158PDS_2015-03-06 2015-07-14             0
 2 158PDS     158PDS_2015-10-07 2015-10-07            NA
 3 158PDS     158PDS_2016-04-27 2016-05-11             1
 4 158PDS     158PDS_2016-10-04 2016-10-04             1
 5 158PDS     158PDS_2017-03-23 2017-05-23             0
 6 158PDS     158PDS_2017-09-12 2017-09-12             1
 7 158PDS     158PDS_2018-02-13 2018-02-20             1
 8 158PDS     158PDS_2018-08-22 2018-08-22             0
 9 158PDS     158PDS_2019-02-14 2019-02-14             0
10 158PDS     158PDS_2015-02-10 2019-12-06             1

In case it helps:

samh@apollo:Data $ md5sum npi_2020-07-13.csv 
b784ad6bf2cefe227afa776263a30aba  npi_2020-07-13.csv
zenourn commented 3 years ago

Yes, I have the same issues present in my copy of the file, it is a code issue but not your code! I've fixed the issue in the code from Kyla's paper that was used to generate this file. A few of us have worked on that code but isn't in Github :-( so hard to know when it was introduced.

There is a case in REDCap where record instances can become out of sync between instruments and the correct way to do the linkage back to sessions is shown here: https://github.com/nzbri/redcap/blob/master/R/import_example.Rmd

I've put the updated file in a Google Shared Drive that you should have access to - if you txt me your cell (0272222585) I'll txt you back the password for the encrypted archive.

Excellent detective work!!

sharrison5 commented 3 years ago

I've just been going through the new spreadsheet — thank you!! While I think the issue is fixed, it's not a case that we end up with more sessions unfortunately:

> mean(is.na(old_data$NPI_total))
[1] 0.253689
> mean(is.na(new_data$NPI_total))
[1] 0.2548241

Rather, the updated spreadsheet pushes the 'missingness' into the 2015 baselines. However, the sessions do now appear to be correctly matched:

> read_csv(file.path("..", "Data", "npi_2020-07-13_v2.csv")) %>%
+   filter(subject_id == "158PDS" & lubridate::year(session_date) == 2015) %>%
+   select(subject_id, session_id, npi_date, npi_g_present)

    [[ Clipped column specification ]]

# A tibble: 2 x 4
  subject_id session_id        npi_date   npi_g_present
  <chr>      <chr>             <date>             <dbl>
1 158PDS     158PDS_2015-03-06 2015-07-14             0
2 158PDS     158PDS_2015-10-07 2015-10-07            NA

> new_data %>%
+   filter(subject_id == "158PDS" & lubridate::year(session_date) == 2015) %>%
+   select(subject_id, session_id, NPI_date, NPI_apathy_present)

# A tibble: 3 x 4
  subject_id session_id        NPI_date   NPI_apathy_present
  <chr>      <chr>             <date>     <lgl>             
1 158PDS     158PDS_2015-02-10 NA         NA                
2 158PDS     158PDS_2015-03-06 2015-07-14 FALSE             
3 158PDS     158PDS_2015-10-07 2015-10-07 NA                

> old_data %>%
+   filter(subject_id == "158PDS" & lubridate::year(session_date) == 2015) %>%
+   select(subject_id, session_id, NPI_date, NPI_apathy_present)

# A tibble: 3 x 4
  subject_id session_id        NPI_date   NPI_apathy_present
  <chr>      <chr>             <date>     <lgl>             
1 158PDS     158PDS_2015-02-10 2019-12-06 TRUE              
2 158PDS     158PDS_2015-03-06 2015-07-14 FALSE             
3 158PDS     158PDS_2015-10-07 2015-10-07 NA                

There are still a few date mismatch errors, but these are sporadic enough that I'm not worried (though I will add an exclusion criterion):

> npi %>%
+   filter(abs(as.numeric(npi_date - session_date, units = "days")) > 120) %>%
+   select(session_id, session_date, npi_date)

# A tibble: 19 x 3
   session_id          session_date npi_date  
   <chr>               <date>       <date>    
 1 006BIO_2016-02-05   2016-02-05   2015-03-09 <- Fixed, date wrong in raw data
 2 027LPR-C_2009-01-06 2009-01-06   2018-01-31 <- Fixed, date wrong in raw data
 3 028BIO_2017-06-01   2017-06-01   2011-07-11 <- Fixed, date wrong in raw data
 4 045JYH_2008-07-09   2008-07-09   2008-12-12 <- S.O. session was completed on 12/12/2008
 5 047BIO_2017-02-16   2017-02-16   2007-03-02 <- Fixed, date wrong in raw data
 6 056BIO_2009-06-25   2009-06-25   2008-07-02 <- Raw data dated as the day it was entered (i.e. different days for each measure). Have reverted to the date collected from the Alice record (i.e. 16-07-2009).
 7 074BIO_2018-02-14   2018-02-14   2015-03-14 <- Fixed, date wrong in raw data. Multiple erroneous dates in file
 8 080BIO_2017-05-10   2017-05-10   2017-01-09 <- Fixed, date wrong in raw data. Multiple erroneous dates in file
 9 137ADL_2016-09-30   2016-09-30   2016-03-30 <- Fixed, date wrong in raw data.
10 158PDS_2015-03-06   2015-03-06   2015-07-14 <- Fixed, date wrong in raw data + Alice
11 166PDS_2015-04-01   2015-04-01   2014-04-20 <- S.O. session was completed on 20/04/2015
12 195PDS_2015-11-02   2015-11-02   2015-05-09 <- Fixed, date wrong in raw data.
13 210PDS_2017-10-27   2017-10-27   2017-01-11 <- Fixed, date wrong in raw data.
14 258PDS_2017-09-01   2017-09-01   2018-09-01 <- Fixed, date wrong in raw data.
15 275PDS_2017-11-01   2017-11-01   2018-03-02 <- S.O. session was completed on 02/03/2018
16 303PDS_2018-01-19   2018-01-19   2001-01-19 <- Fixed, date wrong in raw data.
17 342PDS_2016-09-19   2016-09-19   2017-09-12 <- CANNOT LOCATE FILE. Assume that Alice is correct and have fixed, date wrong in raw data.
18 352PDS_2016-11-23   2016-11-23   2017-11-23 <- CANNOT LOCATE FILE. Assume that Alice is correct and have fixed, date wrong in raw data.
19 366PDS_2020-03-03   2020-03-03   2020-10-10 <- Fixed, date wrong in raw data.

Edit: full list added @zenourn.

sharrison5 commented 3 years ago

Outliers in LED

Just noticed the following potential issue with LED based on the outlier in #16! This is for the subject below:

> full_data %>% filter(subject_id == "097BIO") %>% select(session_id, LED)
# A tibble: 6 x 2
  session_id          LED
  <chr>             <dbl>
1 097BIO_2010-02-09    0 
2 097BIO_2012-03-06  440 
3 097BIO_2014-01-28 1066 
4 097BIO_2016-02-22  950.
5 097BIO_2018-02-01 1074.
6 097BIO_2020-01-14 8250.

Going forwards, what's the best way of dealing with things like this cropping up? Would it be easier for everyone else if I could modify these things myself?

Thanks! 😄

zenourn commented 3 years ago

Thanks! In this case I think it is due to pramipexole being entered as 75 mg rather than 0.75 mg. I've fixed this and it should update after the next export (~10 mins).

This is a great way of dealing with things like this, I deal with a lot of data quality issues and can generally fix them relatively quickly most of the time.

sharrison5 commented 3 years ago

Duplicate scan numbers

Some sessions have duplicate scan numbers (which seems to correspond to 2015 AnxS0 or CE1 sessions). Not a huge issue as the sessions are often close, but would be useful to know at which session the scan happened. Thanks!

sessions <- chchpd::import_sessions(exclude = FALSE)
sessions %>%
  # Propagate scan numbers to same session_id
  mutate(mri_scan_no = na_if(mri_scan_no, "None")) %>%
  group_by(session_id) %>%
  fill(mri_scan_no, .direction = "downup") %>%
  mutate(studies = paste0(session_suffix, collapse = ", ")) %>%
  ungroup() %>%
  # Remove duplicate sessions
  distinct(session_id, mri_scan_no, studies) %>%
  # Extract duplicate scan numbers
  filter(!is.na(mri_scan_no)) %>%
  group_by(mri_scan_no) %>%
  filter(n() > 1) %>%
  ungroup() %>%
  print(n = Inf)
# A tibble: 18 x 3
   session_id        mri_scan_no studies 
   <chr>             <chr>       <chr>   
 1 194PDS_2015-07-09 41098       AnxS0   
 2 194PDS_2015-09-09 41098       F0      
 3 204PDS_2015-05-21 40319       AnxS0   
 4 204PDS_2015-07-01 40319       F0      
 5 246PDS_2015-07-11 41099       AnxS0   
 6 246PDS_2015-11-03 41099       F0      
 7 260PDS_2015-10-21 42544       F0, PET0
 8 260PDS_2015-12-04 42544       CE1     
 9 261PDS_2015-08-11 41682       AnxS0   
10 261PDS_2015-10-20 41682       F0, PET0
11 271PDS_2015-07-21 41345       AnxS0   
12 271PDS_2015-08-28 41345       F0      
13 280PDS_2015-11-12 43122       F0      
14 280PDS_2015-12-08 43122       CE1     
15 289PDS_2015-08-06 41582       AnxS0   
16 289PDS_2015-11-03 41582       F0, PET0
17 320PDS_2015-08-31 42046       AnxS0   
18 320PDS_2015-10-23 42046       F0        

CC: @tracymelzer

zenourn commented 3 years ago

So this happened when people have been in the Anxiety study and had an MRI scan but then a couple of months later have a full assessment. There was no point in re-scanning so the scan from the previous session was used. The mri_scan_no here is the MRI scan number field in Alice for each session but we actually have a much more complete source for MRI data.

If you use mri = import_MRI() you'll get a dataframe with the session_id, scan_no, scan_date, etc. This should not have any duplicates, however in cases where it links the MRI scan to the AnxS0 session, you actually might want to link it to the F0 session because the F0 session has much more complete data. Things have even become more complicated lately with MRI scans and cognitive assessments often having a huge gap due to COVID-19 where we could scan but not assess people. Given the MRI data you almost want to manually optimise what session it is best to link each scan to given what assessment data you require and a time delta penalty.

sharrison5 commented 3 years ago

Awesome, thanks! That's great to know how it all fits together, and as you say it sounds like taking scan_date and a bit of custom linking code is the way to go 👍