Create a git-annex repo for data from Karolinska

plbenveniste commented 1 month ago

The data is stored in duke:mri/karo/20200612_longitudinal The steps are the following:

[x] A git-annex repo should be created :
[x] The data should be Bidsfied
[x] The data should be pushed to the repo
[x] A PR should be opened
[ ] The PR should be merged.

@jcohenadad What name should be used for the git-annex repo ? ms-karolinska ?

@mguaypaq Could you create the corresponding git-annex repo ?

This is related to issue 76. Creating this issue here to centralize the work on MS dataset.

jcohenadad commented 1 month ago

What name should be used for the git-annex repo ? ms-karolinska ?

given that karolinska has/will contribute to multiple datasets, coming from different studies, I think we need to specify them. Eg, this one could be called: ms-karolinska-2020

plbenveniste commented 1 month ago

Thanks @jcohenadad !

I am currently working on the bidsification of the data and facing a few issues. I am following the dcm2bids tutorial and ran the following commands:

conda activate dcm2bids
mkdir bids_karo
dcm2bids_scaffold -o bids_karo

The config file I created is the following (feedback is welcome on the suffixes chosen). It is stored in bids_karo/code

{
  "descriptions": [
    {
      "datatype": "anat",
      "suffix": "acq-MPRsag_T1w",
      "criteria": {
        "SeriesDescription": "t1_mpr_ns_sag_1mm_iso"
      } 
    },
    {
      "datatype": "anat",
      "suffix": "acq-sagDF_T2w",
      "criteria": {
        "SeriesDescription": "t2_space_dark-fluid_sag_REK_tra_3mm"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-sagDF_T2w",
      "criteria": {
        "SeriesDescription": "t2_space_dark-fluid_sag_REK_3mm_tra"
      }
    },

    {
      "datatype": "anat",
      "suffix": "acq-tseSag_T2w",
      "criteria": {
        "SeriesDescription": "t2_tse_sag_MS"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-sagP2_T2w",
      "criteria": {
        "SeriesDescription": "t2_space_sag_p2_iso_REK_tra_3mm"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-sagP2_T2w",
      "criteria": {
        "SeriesDescription": "t2_space_sag_p2_iso_REK_3mm_tra"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-me2d_T2w",
      "criteria": {
        "SeriesDescription": "t2_me2d_tra_p2_3mm"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-tse_T2w",
      "criteria": {
        "SeriesDescription": "t2_tse_tra"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-MPRsag_T1w",
      "criteria": {
        "SeriesDescription": "t1_mpr_ns_sag_1mm_iso_REK_1mm_tra"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-MPRsag_T1w",
      "criteria": {
        "SeriesDescription": "t1_mpr_ns_sag_1mm_iso_MPR_3mm_tra"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-MPRsagDF_T2w",
      "criteria": {
        "SeriesDescription": "t2_space_dark-fluid_sag_MPR_3mm_tra"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-MPRsagP2_T2w",
      "criteria": {
        "SeriesDescription": "t2_space_sag_p2_iso_MPR_3mm_tra"
      }
    }    
  ]
}

I created the following bash script to convert the dcm images to BIDS:

#!/bin/bash

# Check if the correct number of arguments is provided
if [ "$#" -ne 3 ]; then
    echo "Usage: $0 path/to/config.json path/to/output_dir path/to/dicom"
    exit 1
fi

# Get the config file, output directory, and DICOM directory from the command line arguments
config_file="$1"
output_dir="$2"
dicom_path="$3"

# Iterate over each folder in the DICOM directory
for folder in "$dicom_path"/SW1-*; do
  # Check if it is a directory
  if [ -d "$folder" ]; then
    # Extract participant and session info from the folder name
    # The folder bame is */SW1-1773_M0: the participant should be 1773 and the session M0
    subfolder="${folder##*/}"
    # Participant is the number after SW1- and before _
    participant="${subfolder#SW1-}"
    participant="${participant%%_*}"
    # Session is the letter after the _
    session="${subfolder##*_}"

    echo "Converting participant $participant session $session"

    # Define the DICOM directory
    dicom_dir="$folder"

    # Run dcm2bids
    dcm2bids -d "$dicom_dir" -p "$participant" -s "$session" -c "$config_file" -o "$output_dir" --bids_validate
  fi
done

echo "All conversions are done."

The script was ran using the following command:

bids_karo/code/convert_dcm2bids.sh bids_karo/code/dcm2bids_config.json  bids_karo/ 20200612_longitudinal/Karolinska_data.1/

However, some files don't have the field SeriesDescription and that was raised in the output :

INFO    | SIDECAR PAIRING
INFO    | No Pairing  <-  001_SW1-1875_M12_0_i00001
INFO    | No Pairing  <-  001_SW1-1875_M12_0_i00004
INFO    | No Pairing  <-  002_SW1-1875_M12_0
INFO    | No Pairing  <-  002_SW1-1875_M12_0a
INFO    | No Pairing  <-  003_SW1-1875_M12_0
INFO    | No Pairing  <-  003_SW1-1875_M12_0a
INFO    | No Pairing  <-  004_SW1-1875_M12_0
WARNING | NO PAIRING WAS FOUND. BIDS FOLDER "BIDS_KARO/SUB-1875/SES-M12" WON'T BE CREATED. CHECK YOUR CONFIG FILE.

You can find these files and logs in the following folder : duke/temp/plben/create_karo_gitannex/bids_karo/tmp_dcm2bids

What should I do ? How should I modify my config file to work with these files (I just showed an example but there are more that didn't work).

@jcohenadad @valosekj @NathanMolinier Any ideas ?

plbenveniste commented 1 month ago

After some investigation, I found that using "SequenceName" would work as well to create the file suffix. The suffixes created are the following:

{
  "descriptions": [
    {
      "datatype": "anat",
      "suffix": "acq-sagMprage_T1w",
      "criteria": {
        "SequenceName": "*tfl3d1_16ns"
      } 
    },
    {
      "datatype": "anat",
      "suffix": "acq-sagTse_T2w",
      "criteria": {
        "SequenceName": "*tseR2d1rr19"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-me2d_T2w",
      "criteria": {
        "SequenceName": "*me2d1r4"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-Tse_T2w",
      "criteria": {
        "SequenceName": "*tseR2d1rs17"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-sagMprageDf_T2w",
      "criteria": {
        "SequenceName": "*spcir_278ns"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-sagMprageP2_T2w",
      "criteria": {
        "SequenceName": "*spcR_282ns"
      }
    },
    {
      "datatype": "anat",
      "suffix": "localiser",
      "criteria": {
        "SequenceName": "*fl2d1"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-sag_T1w",
      "criteria": {
        "SequenceName": "*spcir_257ns"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-epB0",
      "criteria": {
        "SequenceName": "*ep_b0"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-epB01000",
      "criteria": {
        "SequenceName": "*ep_b0_1000"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-epB1000t",
      "criteria": {
        "SequenceName": "*ep_b1000t"
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-cor_T1w",
      "criteria": {
        "SequenceName": "*h2d1_205",
        "ImageOrientationPatientDICOM": [1,0,0,0,0,-1]
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-sag_T1w",
      "criteria": {
        "SequenceName": "*h2d1_205",
        "ImageOrientationPatientDICOM": [0,1,0,0,0,-1]
      }
    },
    {
      "datatype": "anat",
      "suffix": "acq-ax_T1w",
      "criteria": {
        "SequenceName": "*h2d1_205",
        "ImageOrientationPatientDICOM": [1,0,0,0,1,0]
      }
    }
  ]
}

This should cover every-case in the dataset.

Only the following files were not transfered because they didn't look relevant:

100_SW1-2128_M12_0a_ROI1.nii.gz
101_SW1-2128_M12_0_ROI1.nii.gz
103_SW1-2128_M12_0a_ROI1.nii.gz
104_SW1-2128_M12_0_ROI1.nii.gz
106_SW1-2128_M12_0a_ROI1.nii.gz
107_SW1-2128_M12_0_ROI1.nii.gz
100_SW1-2128_M12_0_ROI1.nii.gz
102_SW1-2128_M12_0a_ROI1.nii.gz
103_SW1-2128_M12_0_ROI1.nii.gz
105_SW1-2128_M12_0a_ROI1.nii.gz
106_SW1-2128_M12_0_ROI1.nii.gz
108_SW1-2128_M12_0a_ROI1.nii.gz
101_SW1-2128_M12_0a_ROI1.nii.gz
102_SW1-2128_M12_0_ROI1.nii.gz
104_SW1-2128_M12_0a_ROI1.nii.gz
105_SW1-2128_M12_0_ROI1.nii.gz
107_SW1-2128_M12_0a_ROI1.nii.gz
108_SW1-2128_M12_0_ROI1.nii.gz

Feedback on the chosen conventions would be appreciated.

plbenveniste commented 1 month ago

The files contained in the 4 folders (Karolinska_data.1, Karolinska_data.2, Karolinska_data.3 and `Karolinska_data.4) were bidsified using the following line of code:

bids_karo/code/convert_dcm2bids.sh bids_karo/code/dcm2bids_config.json  bids_karo/ 20200612_longitudinal/Karolinska_data.1
bids_karo/code/convert_dcm2bids.sh bids_karo/code/dcm2bids_config.json  bids_karo/ 20200612_longitudinal/Karolinska_data.2
bids_karo/code/convert_dcm2bids.sh bids_karo/code/dcm2bids_config.json  bids_karo/ 20200612_longitudinal/Karolinska_data.3
bids_karo/code/convert_dcm2bids.sh bids_karo/code/dcm2bids_config.json  bids_karo/ 20200612_longitudinal/Karolinska_data.4

The metadata was added using the file code/add_dataset_metadata.py which takes data from the file 20200612_longitudinal/Karolinska_data_exported_2020.06.12\ .xlsx.

Everything is done and stored on /home/GRAMES.POLYMTL.CA/p119007/create_karo_gitannex/bids_karo.

Waiting for review of the conventions and the creation of the git-annex repo.

mguaypaq commented 3 weeks ago

I created the repo and gave @plbenveniste write access: https://data.neuro.polymtl.ca/datasets/ms-karolinska-2020

plbenveniste commented 3 weeks ago

Some modifications were done in the .json file to make sure that DWI files are stored under /dwi and not /anat. Also, the localizers were done for T1w images, therefore the contrast was added in the file name.

plbenveniste commented 3 weeks ago

The data was copied from the folder on romane to the git-annex folder using the following command :

cp -a bids-karo/. ms-karolinska-2020/

Useless files were removed (such as tmpdcm2bids). It was commited and then pushed to the remote branch.

Now ready for review!

mguaypaq commented 1 week ago

I left some review comments on the pull request: https://data.neuro.polymtl.ca/datasets/ms-karolinska-2020/pulls/1

ivadomed / ms-lesion-agnostic

Create a git-annex repo for data from Karolinska #16