wadpac / GGIR

Code corresponding to R package GGIR
https://wadpac.github.io/GGIR/
Apache License 2.0
94 stars 60 forks source link

Discrepancy in MVPA between different input methods #1178

Open pinweichen opened 1 month ago

pinweichen commented 1 month ago

Hi there, I was testing the read.my.acc utilization of the GGIR and noticed that there is a discrepancy in the result of the same file if I input it as gt3x or as a customized csv. I am building a customized pipeline to organize data from different actigraphy devices with similar preprocessing steps (e.g., resampling, impute, or calibration).

Here are my two comparisons: (1) I input gt3x files directly into GGIR and get one result. (2) I converted a gt3x file into a CSV using read.gt3x::read.gt3x. Then I place a one-line header and include information needed for the "rmc." function. I was able to run through the GGIR and obtained a result.

However, I noticed there is a discrepancy in MVPA values between different input methods. The intensity is larger when I specify the rmc.doresample = T and rmc.check4timegaps = T. If I use an approx function to custom impute all time gaps, and turn off rmc.doresample and rmc.check4timegaps, the intensity becomes way too small that in part 2 there is barely any MVPA.

My question is if there are any calibration or normalization steps that I'm missing that exist for gt3x but not for the customized CSV files. I know there is a CSV input for ActiGraph data. However, I would like to standardize some preprocessing for all data from different actigraphy brands. I want to use the custom CSV input for the GGIR and produce similar results as if I input gt3x directly.

I borrowed the debug issue format here.

To Reproduce version of GGIR (3.0-0). We started the project when this version was available.

  1. Sensor brand: ActiGraph

  2. Data format: customized CSV with imputation steps using approx function

  3. Approximate recording duration 7 days

  4. Are you using a sleep diary to guide sleep detection: NO

  5. Copy of R command used:

  6. I customized some parts of the rmc. functions to fit the header name reading of my customized csv. rmc.firstrow.header = 1, rmc.header.length = 1, rmc.firstrow.acc = 2, # first row is header rmc.col.time = 1, rmc.col.acc = 2:4, rmc.unit.time = "UNIXsec", rmc.headername.sf = "Sampling_frequency", rmc.headername.sn = "sensor_type", rmc.headername.recordingid = "filename", rmc.header.structure = "std", rmc.doresample = T, rmc.check4timegaps = T

  7. Have you tried processing your data based on GGIR's default argument values? Does the issue you report still appear? Yes, I have. The file can run. The results are different.

  8. Expected behavior I'm hoping to have the same GGIR results between the gt3x direct input and the customized csv of the same file.

  9. I provide config files if that helps. The original gt3x input config_direct_input.csv

The customized csv with all time gaps imputed config_custom_impute.csv

The customized csv without customized impute but turned on rmc.doresample and rmc.check4timegaps. config_custom_csv_resample_on_timegap_on.csv

I can also provide example data and output folders if you need them.

Desktop (please complete the following information):

Thank you very much.

vincentvanhees commented 1 month ago

I assumed that GGIR uses the same code for both data formats.

  1. I am only wondering now whether this recent commit is causing trouble by attempting to look for and impute gaps twice when rmc.imputegaps = TRUE.

  2. Could it be that the difference is in how you created the csv file? In GGIR the gt3x file is read with default read.gt3x arguments https://github.com/wadpac/GGIR/blob/master/R/g.readaccfile.R#L275-L276 and without using the imputezeros or clean options that read.gt3x offers.

  3. The MVPA extraction happens much further down the line and is the same code for all data format. So, I think that differences must be explained by the raw data itself or by how it is read. It may be good to compare the output of read.myacc.csv() to read.gt3x() output:

    • Do you see any obvious synchronisation problems when you plot the signals on top of each other, e.g. time zone difference?
    • Are time series the same length?

Note that saving numeric data to csv can in itself introduce some tiny rounding errors. If that does not lead to a clear explanation then we may need to try comparing the epoch level metrics produced by GGIR part 1 ... to get these load the .RData file form the output subfolder meta/basic folder and inspect object M$metashort. These are the metric values.

l-k- commented 1 month ago

@pinweichen have you tried running one of the data files where you see this issue through the latest version of GGIR?

I know you said you need GGIR 3.0-0 for your project, but there have been a few changes to raw data handling since last October when that version came out.

pinweichen commented 1 month ago

Thank you Lena and Vincent for responding. To Vicent's questions:

  1. The duplicated steps of time gap imputation do not affect the result. The discrepancy results still exist when both rmc.imputegaps = FALSE and rmc.doresample = FALSE. I also was testing on GGIR 3.0-0.

2 & 3. I've compared all three types of raw data and compared them row by row: 1. load gt3x using default gt3x load method, 2. load my customized version of csv using fread, and 3. load the csv file using read.myacc.csv function from GGIR. All three data are identical when loaded. Time series are in the same length.

I also compared the part 1 meta result by loading the meta/basic file. I notice in C variable, the spheredata are different. So are the scale, and offset. (Original_C = gt3x result. std_C = custom csv result.)

Screenshot 2024-08-08 at 2 56 22 PM Screenshot 2024-08-08 at 2 56 37 PM

I've also tested the version when I turned on rmc.imputegaps = TRUE and rmc.doresample = TRUE. Screenshot 2024-08-08 at 3 07 03 PM

It seems like somewhere in the calibration steps creates these errors. Are those calibration steps for custom csv handled differently than the gt3x? Where can I dig more about the potential cause? I do have some header information missing since it didn't read directly from gt3x. What information in the header could be important for the calibration steps?

To Lena, I haven't adapted the GGIR 3.1-2 version yet. I did make some changes in the read.myacc.csv header reading portion in my current pipeline. I will need to move those changes with the 3.1-2 version before I test the custom csv. However, I've tested the output from read.myacc.csv. It produces the same result as if I read the gt3x.

Thank you for your time and patience.

vincentvanhees commented 1 month ago

If you insist on using an older version of GGIR then you also have to life with all the bugs and inconsistencies it had.

If you want the issue to be fixed use the latest GGIR version.

If the issue is still present in the latest GGIR version then please clarify as this is currently unclear to me.

pinweichen commented 1 month ago

Hi Vicent and Lena, Thank you for your patience and help.

I adapted my custom csv file to the current version of read.myacc.csv function in the GGIR (v. 3.1-2). With no modification on the package, I ran my csv into the GGIR and compared it with the gt3x ran. However, the part 1 run became much slower and the same problem persists in which the sphere data calibration used much more data than when reading gt3x. And the MVPA is incorrect compared to the gt3x version.

In addition, the new result has extended data that is longer than the original data. The results showed an imputation was done while rmc.imputegaps = FALSE and rmc.doresample = FALSE. This additional data error only exists in GGIR 3.1-2 but not in GGIR 3.0-0 when the same data was run.

Here is an example of the data (attached) that I fed into GGIR. The data is usually 7 days or 14 days long. I would like to check with you if these header settings and the format are correct. I'm happy to provide the original data if that helps debug this. example_data.csv

Here are the parameter settings: GGIR::GGIR( verbose = T, nonwear_approach = "2023", mode = c(1), datadir = datadir, outputdir = outputdir,
do.report = c(2,4,5), HASIB.algo = algo_name, Sadeh_axis = Sadeh_axis,

do.cal = T, do.imp = T, do.enmo = T, do.anglez = T, chunksize = 1, do.parallel = T,

------------

Custom csv settings

------------

rmc.firstrow.header = 1, rmc.header.length = 8, rmc.firstrow.acc = 10, rmc.col.time = 1,
rmc.col.acc = 2:4, rmc.unit.time = "UNIXsec", rmc.headername.sf = "sample_frequency", rmc.headername.sn = "device_serial_number", rmc.headername.recordingid = "subjectID", rmc.desiredtz = "EST5EDT", rmc.doresample = F, rmc.check4timegaps = F,

------------

strategy = 1, hrs.del.start = 0, hrs.del.end = 0, maxdur = 0, includedaycrit = 1, qwindow = c(0,24), mvpathreshold = c(100), bout.metric = 6, excludefirstlast = FALSE, includenightcrit = 1, cosinor = TRUE,

def.noc.sleep = 1, outliers.only = TRUE, criterror = 4, do.visual = T,

threshold.lig = c(30), threshold.mod = c(100), threshold.vig = c(400), boutcriter = 0.8, boutcriter.in = 0.9, boutcriter.lig = 0.8, boutcriter.mvpa = 0.8, boutdur.in = c(1,10,30), boutdur.lig = c(1,10, 30), boutdur.mvpa = c(1, 5, 10), includedaycrit.part5 = 1/3,
iglevels = c(seq(0,4000,by=25),8000), qlevels = c(c(1380/1440),c(1410/1440),c(1430/1440)),

=====================

Visual report

=====================

timewindow = c("WW"), visualreport = T )

Here is the GGIR 3.1-2 gt3x original report. Report_ggir312_original.gt3x copy.pdf Here is the GGIR 3.1-2 ran with csv version of the gt3x report. Report_ggir312_no_timegap_noresample.csv.pdf

Thank you again for your help.

Best Regards, Benny

pinweichen commented 1 month ago

Here are the M variables from meta/basics for original gt3x load-in and the custom csv load-in. Original Load-in Screenshot 2024-08-11 at 3 51 14 PM

Custom csv load-in Screenshot 2024-08-11 at 3 51 04 PM