nipy / heudiconv

Flexible DICOM conversion into structured directory layouts
https://heudiconv.readthedocs.io
Other
235 stars 125 forks source link

acq_time filled with n/a when dicoms missing AcquisitionDate (other fields appear valid) #612

Closed psadil closed 1 year ago

psadil commented 1 year ago

Summary

I'm trying to use heudiconv to produce a bids folder from scans collected on a Siemens scanner that was recently upgraded to XA30. It's my impression that, to fill the acq_time column in the *scans.tsv files, heudiconv uses the dicom tag AcquisitionDate (here?). In the scans that I have, that tag isn't available.

Here's a snippet from the logs, which shows a warning about the missing AcquisitionDate

INFO: Doing conversion using dcm2niix
INFO: Converting /corral-secure/projects/A2CPS/shared/psadil/jobs/job-33c8b78b-d685-4d2a-a73c-9dacc991edd4-007-heudiconv-sh20198v1/SH_spectrum_health/bids/SH20198V1/sub-20198/ses-V1/anat/sub-20198_ses-V1_T1w (1 DICOMs) -> /corral-secure/projects/A2CPS/shared/psadil/jobs/job-33c8b78b-d685-4d2a-a73c-9dacc991edd4-007-heudiconv-sh20198v1/SH_spectrum_health/bids/SH20198V1/sub-20198/ses-V1/anat . Converter: dcm2niix . Output types: ('nii.gz', 'dicom')
INFO: [Node] Setting-up "convert" in "/tmp/dcm2niixjn_m8lh7/convert".
INFO: [Node] Executing "convert" <nipype.interfaces.dcm2nii.Dcm2niix>
INFO: stdout 2022-11-21T17:41:42.750467:Chris Rorden's dcm2niiX version v1.0.20220720  GCC10.4.0 x86-64 (64-bit Linux)
INFO: stdout 2022-11-21T17:41:42.750467:Found 1 DICOM file(s)
INFO: stdout 2022-11-21T17:41:42.750467:Convert 1 DICOM as /corral-secure/projects/A2CPS/shared/psadil/jobs/job-33c8b78b-d685-4d2a-a73c-9dacc991edd4-007-heudiconv-sh20198v1/SH_spectrum_health/bids/SH20198V1/sub-20198/ses-V1/anat/sub-20198_ses-V1_T1w_heudiconv941 (256x256x176x1)
INFO: stdout 2022-11-21T17:41:44.310920:Compress: "/opt/conda/envs/v1.0.20220720/bin/pigz" -b 960 -n -f -6 "/corral-secure/projects/A2CPS/shared/psadil/jobs/job-33c8b78b-d685-4d2a-a73c-9dacc991edd4-007-heudiconv-sh20198v1/SH_spectrum_health/bids/SH20198V1/sub-20198/ses-V1/anat/sub-20198_ses-V1_T1w_heudiconv941.nii"
INFO: stdout 2022-11-21T17:41:44.310920:Conversion required 1.592072 seconds (0.079203 for core code).
INFO: [Node] Finished "convert", elapsed time 1.622277s.
WARNING: Failed to get date/time for the content: 'FileDataset' object has no attribute 'AcquisitionDate'
WARNING: Failed to find task field in /corral-secure/projects/A2CPS/shared/psadil/jobs/job-33c8b78b-d685-4d2a-a73c-9dacc991edd4-007-heudiconv-sh20198v1/SH_spectrum_health/bids/SH20198V1/sub-20198/ses-V1/anat/sub-20198_ses-V1_T1w.json.
INFO: Post-treating /corral-secure/projects/A2CPS/shared/psadil/jobs/job-33c8b78b-d685-4d2a-a73c-9dacc991edd4-007-heudiconv-sh20198v1/SH_spectrum_health/bids/SH20198V1/sub-20198/ses-V1/anat/sub-20198_ses-V1_T1w.json file

This is only a warning, and conversion proceeds. The main issue is that the scans.tsv file no longer has usable entries in the acq_time column (rows filled with n/a).

Fortunately, there is an AcquisitionDateTime that could presumably fill the same role, although the formatting may need work.

The fields used in get_dicom_series_time would also work (SeriesDate / SeriesTime).

Here is an example phantom scan with AcquisitionDateTime but not AcquisitionDate: sub-sh_ses-221108_T1w.zip

>>> import pydicom
>>> dcm = pydicom.dcmread('sub-sh_ses-221108_T1w/Physicist_ICCB.MR.Physicist_ABCD.2.1.2022.11.08.13.02.36.481.79844170.dcm')
>>> dcm.get('AcquisitionDateTime')

'20221108115315.110000'

>>> dcm.SeriesDate
'20221108'

>>> dcm.SeriesTime
'115903.545000'

This is perhaps related to #537, although again I'm only seeing a warning and not an error/crash

Platform details:

Choose one:

name: v1.0.20220720
channels:
  - conda-forge
dependencies:
  - python
  - traits>=4.6
  - scipy
  - numpy
  - pandas
  - nomkl
  - nilearn
  - pydicom
  - openpyxl 
  - deepdiff
  - pigz
  - dcm2niix=1.0.20220720
  - git-annex
  - nipype
  - heudiconv=0.11.6 
  - dcmstack
  - sqlalchemy<1.4.0
  - cmake
  - compilers
  - make
  # pybids suff
  - docopt
  - num2words
  - wrapt>=1.0
  - astor>=0.8
  - formulaic<0.4
  - nodejs # bids-validator

psadil/heudiconv:221121 https://hub.docker.com/repository/docker/psadil/heudiconv

yarikoptic commented 1 year ago

Thanks for the thorough report! indeed #537 is related but different (Acquisition vs Series) ;) So, Siemens decided to change to use DateTime for Acquisition but leave as is separate for Series... uff...

I need to find some quality time for heudiconv to address both and possibly look at other locations where access AcquisitionDate. I feel like we likely need some generic def get_datetime_from_dcm(dcm_data, context: str) -> str: where context could be Series or Acquisition and it would check if values available separately from f{context}Date and f{context}Time or from f{context}DateTime and process them accordingly.

May be you would be interested to prepare PR? ;)

psadil commented 1 year ago

Sure, happy to take a look later this week. Though, as a warning, I'll probably need to ping you with design questions :)

psadil commented 1 year ago

For testing, should I include a phantom dicom that was collected with XA30 (or more specifically, one with the AcquisitionDate field)?

Also, I'm wondering about use of the timestamp when storing compressed dicoms. In cases where all datetime information has been stripped from the header, could ti.mtime be left alone? e.g., something that allows for https://github.com/nipy/heudiconv/blob/a5e6718af00dadcd5f055f1a1997aa17ad61b362/heudiconv/dicoms.py#L369-L373 to instead be

def _assign_dicom_time(ti):
    # Reset the date to match the one of the last commit, not from the
    # filesystem since git doesn't track those at all
    if dcm_time is not None:
         ti.mtime = dcm_time
    return ti
yarikoptic commented 1 year ago

For testing, should I include a phantom dicom that was collected with XA30 (or more specifically, one with the AcquisitionDate field)?

I just put a datalad dataset from @jmatthews-rotman (ref: https://github.com/ReproNim/reproin/issues/58) at http://datasets.datalad.org/?dir=/dicoms/jmatthews-rotman/siemens-VA30A-1 . But those seems to have DateTime:

$> dcmdump MRI_102TD_PHA_S.MR.Chen_Matthews_1.15.2.2022.11.16.15.50.20.357.31204783.dcm | grep '\<AcquisitionDate'
(0008,002a) DT [20221116142556.992500]                  #  22, 1 AcquisitionDateTime

do you have some shareable dicoms -- I could also share them from datasets.datalad.org (we don't want to "abuse" git in this repo)?

for the 2nd question -- we need to figure out smth to make it reproducible, most likely some other date/time from dicoms. Anything in mind?

psadil commented 1 year ago

I'm not following why files from that dataset wouldn't work -- don't they match this issue of not having the AcquisitionDate field?

$ dcmdump MRI_102TD_PHA_S.MR.Chen_Matthews_1.15.2.2022.11.16.15.50.20.357.31204783.dcm | grep -P '(?<!Frame)*(Date|Time)$'
(0008,0012) DA [20221116]                               #   8, 1 InstanceCreationDate
(0008,0013) TM [141629.732500]                          #  14, 1 InstanceCreationTime
(0008,0020) DA [20221116]                               #   8, 1 StudyDate
(0008,0021) DA [20221116]                               #   8, 1 SeriesDate
(0008,0023) DA [20221116]                               #   8, 1 ContentDate
(0008,002a) DT [20221116141629.732500]                  #  22, 1 AcquisitionDateTime
(0008,0030) TM [141602.100000]                          #  14, 1 StudyTime
(0008,0031) TM [141647.350000]                          #  14, 1 SeriesTime
(0008,0033) TM [141648.517000]                          #  14, 1 ContentTime
(0010,0030) DA [19000101]                               #   8, 1 PatientBirthDate
(0040,0244) DA [20221116]                               #   8, 1 PerformedProcedureStepStartDate
(0040,0245) TM [141602.100000]                          #  14, 1 PerformedProcedureStepStartTime
(0040,0250) DA [20221116]                               #   8, 1 PerformedProcedureStepEndDate
        (0018,0080) DS [3.15]                                   #   4, 1 RepetitionTime
        (0018,9082) FD 1.3700000000000001                       #   8, 1 EffectiveEchoTime
        (0018,9074) DT [20221116141629.732500]                  #  22, 1 FrameAcquisitionDateTime
        (0018,9151) DT [20221116141629.732500]                  #  22, 1 FrameReferenceDateTime
        (0018,9082) FD 1.3700000000000001                       #   8, 1 EffectiveEchoTime
        (0018,9074) DT [20221116141629.732500]                  #  22, 1 FrameAcquisitionDateTime
        (0018,9151) DT [20221116141629.732500]                  #  22, 1 FrameReferenceDateTime
        (0018,9082) FD 1.3700000000000001                       #   8, 1 EffectiveEchoTime
        (0018,9074) DT [20221116141629.732500]                  #  22, 1 FrameAcquisitionDateTime
        (0018,9151) DT [20221116141629.732500]                  #  22, 1 FrameReferenceDateTime

This seems comparable to the case that I was working with, which is one where the files don't have AcquisitionDate but do have AcquisitionDateTime (and also SeriesDate + SeriesTime).

If that's true, then how about one of the small ones, like MRI_102TD_PHA_S.MR.Chen_Matthews_1.4.1.2022.11.16.15.50.20.357.31204552.dcm ?

For assigning mtime reproducibly, does the time need to be plausible? My understanding of #537 was a case where the dicoms were missing all dates/times due to stripping during anonymization (e.g., https://github.com/nipy/heudiconv/issues/537#issuecomment-1045640009). Reproducibility could still be achieved by assigning mtime to some constant (e.g., 1970, the date of the first release of datalad, etc), or some wacky value that's specific to the dicom like a hash of a UID interpreted as seconds

yarikoptic commented 1 year ago

I think I am in rush mixed up what was missing ;) smaller one is good.

re mtime -- we would be sorting by it , so must not be the same value since can't sort them reliably if they all have the same value. Could be some UUID I guess.

psadil commented 1 year ago

I'm seeing

% grep -ri mtime heudiconv
heudiconv/dicoms.py:    """Get integer relating to dicom_list for setting mtimes reproducibly 
heudiconv/dicoms.py:        ti.mtime = dcm_time
heudiconv/cli/monitor.py:    # time.time() - os.path.getmtime(paths2process[0]) > WAIT_TIME:

That lack of many instances of mtime has me wondering where it would be used. Could you expand on the concern about sorting?

yarikoptic commented 1 year ago

from a quick grep for sorted with 47 hits, none look related besides "similar in spirit" but not directly affected https://github.com/nipy/heudiconv/blob/HEAD/heudiconv/heuristics/multires_7Tbold.py#L28 I guess. But overall - pretending to have time, which now sorts in some other order, seems very misleading to me - might be quite a rabbit hole someone would need to track down happen to run into something odd with such approach in their case.

psadil commented 1 year ago

That does sound like confusion worth avoiding.

I'd suggest pulling something like AcquisitionDateTime or AcquisitionTime, but the fields that I'm seeing are only "Conditionally Required (empty if unknown)" at best, or "Optional" at worst.

Am I correct in thinking that mtime is (re)set in the first place for reasons related to datalad? I think knowing why it needs to be set (and reproducibly so) could be helpful. Could reproducible mtime in any way be designated as a "feature" that relies on the existence of a particular set of fields in the DICOM header (specifically, fields that might not be present in every valid DICOM)?

yarikoptic commented 1 year ago

Am I correct in thinking that mtime is (re)set in the first place for reasons related to datalad?

AFAIK not at all. If anything, datalad with its --fake-dates fakes commit dates ;)

I think knowing why it needs to be set (and reproducibly so) could be helpful.

So that rerunning conversion doesn't produce unexpected difference, especially for large files such as dicom tarballs we are storing within sourcedata/. That is the code you saw in https://github.com/nipy/heudiconv/issues/612#issuecomment-1428732715 and the code comment says it

    dcm_time = get_dicom_series_time(dicom_list)

    def _assign_dicom_time(ti):
        # Reset the date to match the one from the dicom, not from the
        # filesystem so we could sort reproducibly
        ti.mtime = dcm_time
        return ti

Another place where date/timestamps are used -- are populated in _scans.tsv files. There I guess a lesser issue, in particular since we already add random string there.

Could reproducible mtime in any way be designated as a "feature" that relies on the existence of a particular set of fields in the DICOM header (specifically, fields that might not be present in every valid DICOM)?

may be, may be ... but again -- where do we have a use case where there is completely no date and time available - do you have an example? above we talked about where it just becomes two separate fields (for Date and Time). If no Date available, we can just choose date which BIDS assumes for annonymization if dates are prior 1925 so year could be produced reproducibly from smth like Jan 1 1900 + hash(UID) % (24*365) days . Indeed we would loose sortability (no longer sure if it is valuable) but date would be clearly anonymized, and we use correct time. But then I am not sure if tar would not start puking on those dates since that is before the epoch. So when storing to tar we need to offset by those 70 years, uff.

psadil commented 1 year ago

Indeed we would lose sortability (no longer sure if it is valuable)

Allowing loss of sortability would be easier. Though, ofc, I don't want to accidentally cause some unexpected issue in code/workflows that rely on this behavior.

where do we have a use case where there is completely no date and time available - do you have an example?

That was my understanding of the related case, #537 (specifically, this comment)

Though, I suppose, even in cases of anonymization, AcquisitionNumber (0020,0012) is likely to be preserved. If mtime is allowed to be based on something that is a bit arbitrary, then perhaps it would be sensible to set mtime based on AcquisitionNumber?

To avoid issues with epoch/tar that you mention, the proposal would be

  1. In cases where a date/time/datetime is available, use that (current behavior)
  2. In cases where a date/time/datetime is not available, use (something like) Jan 1 1970 + AcquisitionNumber (in years) (try to anticipate cases of anonymization removing dates)
  3. If 1 and 2 are not possible, then error and provide message about missing information (alternatively, as a last resort rely on setting a totally fake date based on hash(UID) ?)

Another place where date/timestamps are used -- are populated in _scans.tsv files. There I guess a lesser issue, in particular since we already add random string there.

Thanks for bringing this up. I had been assuming that setting mtime would be separate from values in acq_time -- that when no date/datetime is available from the DICOM header then acq_time would go to n/a, as happens now. Just to be clear, do you think it would be better for the value in mtime to match the value in theacq_time column?

yarikoptic commented 1 year ago
  1. If 1 and 2 are not possible, then error and provide message about missing information (alternatively, as a last resort rely on setting a totally fake date based on hash(UID) ?)

ok

Just to be clear, do you think it would be better for the value in mtime to match the value in theacq_time column?

if it is now -- then there is slight preference to keep it as that, but i do not really see such a need. those tar mtimes AFAIK are really just for reproducible tar generation.

psadil commented 1 year ago

Great, thanks for working through all this. I should be able to update the pull request by the end of the week, including elements in this discussion and the other requested revisions (e.g., python version, docstring style, etc)

github-actions[bot] commented 1 year ago

:rocket: Issue was released in v0.13.0 :rocket: