How is the statistical analysis conducted?

jcohenadad commented 4 months ago

While working on #41, I stumbled across the question of how to format the output CSVs. More specifically: what columns of the CSV will be used, and how are the statistics computed. Across subjects? across levels?

For example, the following CSV file DWI_FA_51_aggregated.csv (resulting from a few subjects), for the FA in the WM (label 51) can produce the following violin plot:

violin_plot

In this CSV file, the column "Filename" was replaced by the column "Subject", due to the necessary aggregation between chunks (#41). Should all CSV (ie: also for the other non-DWI metrics) follow the same rule?

More details needed from @Kaonashi22

Kaonashi22 commented 4 months ago

Thanks, @jcohenadad! I will do statistical analyses between groups (patients and healthy controls) first using the average across all vertebral levels, and then specifically by vertebral level. The output csv file looks good. I would also keep the "label" column and add a row with the average across all vertebral levels for each subject. We can follow the same rule for all output files (including the non-DWI).

jcohenadad commented 4 months ago

I would also keep the "label" column

That's easy to do.

and add a row with the average across all vertebral levels for each subject.

That's more work and conflicts with the current logic of the CSV file: each row currently represents a vertebral level. Adding a column for the average across levels means that

the column 'vertebral level' is undetermined for the row=average
the column 'average across level' is undetermined for the row=not_average

Moreover, it is not common practice to insert redundant information within a CSV file. A better practice is to produce code that interprets the source CSV file, and generate the desired statistics (eg: average across levels, compute STD, compute median, computes min/max, etc.).

jcohenadad commented 4 months ago

df02c23afd4bb8273cc8cb551dbc16940e30ac3d now outputs the following table for DWI scans:

Subject	VertLevel	Label	Size [vox]	MAP()	STD()	Timestamp	Filename	SCT Version
sub-BB277	2	white matter	80.68	0.6483	0.0993	2024-06-18 13:13:20	/path/to/data/sub-BB277_chunk-1_DWI_moco_FA	git-jca/4527-register-template-step0-517cc
sub-BB277	3	white matter	93.12	0.6232	0.1262	2024-06-18 13:13:20	/path/to/data/sub-BB277_chunk-1_DWI_moco_FA	git-jca/4527-register-template-step0-517cc
sub-BB277	4	white matter	100.70	0.6736	0.1203	2024-06-18 13:13:20	/path/to/data/sub-BB277_chunk-1_DWI_moco_FA	git-jca/4527-register-template-step0-517cc
sub-BB277	5	white matter	108.99	0.5660	0.1235	2024-06-18 13:13:20	/path/to/data/sub-BB277_chunk-1_DWI_moco_FA	git-jca/4527-register-template-step0-517cc
sub-BB277	6	white matter	65.68	0.5304	0.1313	2024-06-18 13:13:20	/path/to/data/sub-BB277_chunk-1_DWI_moco_FA	git-jca/4527-register-template-step0-517cc

@Kaonashi22 would you like to apply this format for the other metrics as well? if so,

should we overwrite the original CSV files
or should we create new CSV file with another suffix (eg _formatted)

jcohenadad commented 4 months ago

Idea proposed in https://github.com/sct-pipeline/spine-park/issues/42#issuecomment-2200754999 is implemented in 14f074b2cb64027037a8c47d0f05bc2eb0f93354.

Currently testing... will upload the output results/ folder for your approbation @Kaonashi22

here it is: results.zip

Kaonashi22 commented 4 months ago

That's more work and conflicts with the current logic of the CSV file: each row currently represents a vertebral level. Adding a column for the average across levels means that
* the column 'vertebral level' is undetermined for the row=average

* the column 'average across level' is undetermined for the row=not_average
Moreover, it is not common practice to insert redundant information within a CSV file. A better practice is to produce code that interprets the source CSV file, and generate the desired statistics (eg: average across levels, compute STD, compute median, computes min/max, etc.).

OK, then I will compute the mean across levels separately

Kaonashi22 commented 4 months ago

* should we overwrite the original CSV files

* or should we create new CSV file with another suffix (eg `_formatted`)

We can overwrite the previous files and only keep the formatted ones

Kaonashi22 commented 4 months ago

Idea proposed in #42 (comment) is implemented in 14f074b.

Currently testing... will upload the output results/ folder for your approbation @Kaonashi22

here it is: results.zip

The presentation is good, thanks a lot!

jcohenadad commented 4 months ago

feature implemented

jcohenadad commented 4 months ago

We can overwrite the previous files and only keep the formatted ones

sorry i missed that-- do you still need it or can your analysis script for the statistics figure out the _formatted and _aggregated suffixes?

Kaonashi22 commented 4 months ago

I actually don't need that because I'll rerun the analysis from scratch

From: Julien Cohen-Adad @.> Sent: July 1, 2024 16:34 To: sct-pipeline/spine-park @.> Cc: Lydia Chougar, Dr @.>; Mention @.> Subject: Re: [sct-pipeline/spine-park] How is the statistical analysis conducted? (Issue #42)

We can overwrite the previous files and only keep the formatted ones

sorry i missed that-- do you still need it or can your analysis script for the statistics figure out the _formatted and _aggregated suffixes?

— Reply to this email directly, view it on GitHubhttps://github.com/sct-pipeline/spine-park/issues/42#issuecomment-2200972233, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BFCFJYV4BGD4VKFALWYEHNTZKG4LTAVCNFSM6AAAAABKCRE4U6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQHE3TEMRTGM. You are receiving this because you were mentioned.Message ID: @.***>

sct-pipeline / spine-park

How is the statistical analysis conducted? #42