Closed psadil closed 1 year ago
Thanks for the thorough report! indeed #537 is related but different (Acquisition
vs Series
) ;) So, Siemens decided to change to use DateTime for Acquisition but leave as is separate for Series... uff...
I need to find some quality time for heudiconv to address both and possibly look at other locations where access AcquisitionDate
. I feel like we likely need some generic def get_datetime_from_dcm(dcm_data, context: str) -> str:
where context
could be Series
or Acquisition
and it would check if values available separately from f{context}Date
and f{context}Time
or from f{context}DateTime
and process them accordingly.
May be you would be interested to prepare PR? ;)
Sure, happy to take a look later this week. Though, as a warning, I'll probably need to ping you with design questions :)
For testing, should I include a phantom dicom that was collected with XA30 (or more specifically, one with the AcquisitionDate
field)?
Also, I'm wondering about use of the timestamp when storing compressed dicoms. In cases where all datetime information has been stripped from the header, could ti.mtime
be left alone? e.g., something that allows for https://github.com/nipy/heudiconv/blob/a5e6718af00dadcd5f055f1a1997aa17ad61b362/heudiconv/dicoms.py#L369-L373 to instead be
def _assign_dicom_time(ti):
# Reset the date to match the one of the last commit, not from the
# filesystem since git doesn't track those at all
if dcm_time is not None:
ti.mtime = dcm_time
return ti
For testing, should I include a phantom dicom that was collected with XA30 (or more specifically, one with the
AcquisitionDate
field)?
I just put a datalad dataset from @jmatthews-rotman (ref: https://github.com/ReproNim/reproin/issues/58) at http://datasets.datalad.org/?dir=/dicoms/jmatthews-rotman/siemens-VA30A-1 . But those seems to have DateTime
:
$> dcmdump MRI_102TD_PHA_S.MR.Chen_Matthews_1.15.2.2022.11.16.15.50.20.357.31204783.dcm | grep '\<AcquisitionDate'
(0008,002a) DT [20221116142556.992500] # 22, 1 AcquisitionDateTime
do you have some shareable dicoms -- I could also share them from datasets.datalad.org (we don't want to "abuse" git in this repo)?
for the 2nd question -- we need to figure out smth to make it reproducible, most likely some other date/time from dicoms. Anything in mind?
I'm not following why files from that dataset wouldn't work -- don't they match this issue of not having the AcquisitionDate
field?
$ dcmdump MRI_102TD_PHA_S.MR.Chen_Matthews_1.15.2.2022.11.16.15.50.20.357.31204783.dcm | grep -P '(?<!Frame)*(Date|Time)$'
(0008,0012) DA [20221116] # 8, 1 InstanceCreationDate
(0008,0013) TM [141629.732500] # 14, 1 InstanceCreationTime
(0008,0020) DA [20221116] # 8, 1 StudyDate
(0008,0021) DA [20221116] # 8, 1 SeriesDate
(0008,0023) DA [20221116] # 8, 1 ContentDate
(0008,002a) DT [20221116141629.732500] # 22, 1 AcquisitionDateTime
(0008,0030) TM [141602.100000] # 14, 1 StudyTime
(0008,0031) TM [141647.350000] # 14, 1 SeriesTime
(0008,0033) TM [141648.517000] # 14, 1 ContentTime
(0010,0030) DA [19000101] # 8, 1 PatientBirthDate
(0040,0244) DA [20221116] # 8, 1 PerformedProcedureStepStartDate
(0040,0245) TM [141602.100000] # 14, 1 PerformedProcedureStepStartTime
(0040,0250) DA [20221116] # 8, 1 PerformedProcedureStepEndDate
(0018,0080) DS [3.15] # 4, 1 RepetitionTime
(0018,9082) FD 1.3700000000000001 # 8, 1 EffectiveEchoTime
(0018,9074) DT [20221116141629.732500] # 22, 1 FrameAcquisitionDateTime
(0018,9151) DT [20221116141629.732500] # 22, 1 FrameReferenceDateTime
(0018,9082) FD 1.3700000000000001 # 8, 1 EffectiveEchoTime
(0018,9074) DT [20221116141629.732500] # 22, 1 FrameAcquisitionDateTime
(0018,9151) DT [20221116141629.732500] # 22, 1 FrameReferenceDateTime
(0018,9082) FD 1.3700000000000001 # 8, 1 EffectiveEchoTime
(0018,9074) DT [20221116141629.732500] # 22, 1 FrameAcquisitionDateTime
(0018,9151) DT [20221116141629.732500] # 22, 1 FrameReferenceDateTime
This seems comparable to the case that I was working with, which is one where the files don't have AcquisitionDate
but do have AcquisitionDateTime
(and also SeriesDate
+ SeriesTime
).
If that's true, then how about one of the small ones, like MRI_102TD_PHA_S.MR.Chen_Matthews_1.4.1.2022.11.16.15.50.20.357.31204552.dcm ?
For assigning mtime
reproducibly, does the time need to be plausible? My understanding of #537 was a case where the dicoms were missing all dates/times due to stripping during anonymization (e.g., https://github.com/nipy/heudiconv/issues/537#issuecomment-1045640009). Reproducibility could still be achieved by assigning mtime
to some constant (e.g., 1970, the date of the first release of datalad, etc), or some wacky value that's specific to the dicom like a hash of a UID interpreted as seconds
I think I am in rush mixed up what was missing ;) smaller one is good.
re mtime
-- we would be sorting by it , so must not be the same value since can't sort them reliably if they all have the same value. Could be some UUID I guess.
I'm seeing
% grep -ri mtime heudiconv
heudiconv/dicoms.py: """Get integer relating to dicom_list for setting mtimes reproducibly
heudiconv/dicoms.py: ti.mtime = dcm_time
heudiconv/cli/monitor.py: # time.time() - os.path.getmtime(paths2process[0]) > WAIT_TIME:
That lack of many instances of mtime
has me wondering where it would be used. Could you expand on the concern about sorting?
from a quick grep for sorted
with 47 hits, none look related besides "similar in spirit" but not directly affected https://github.com/nipy/heudiconv/blob/HEAD/heudiconv/heuristics/multires_7Tbold.py#L28 I guess. But overall - pretending to have time, which now sorts in some other order, seems very misleading to me - might be quite a rabbit hole someone would need to track down happen to run into something odd with such approach in their case.
That does sound like confusion worth avoiding.
I'd suggest pulling something like AcquisitionDateTime
or AcquisitionTime
, but the fields that I'm seeing are only "Conditionally Required (empty if unknown)" at best, or "Optional" at worst.
Am I correct in thinking that mtime
is (re)set in the first place for reasons related to datalad
? I think knowing why it needs to be set (and reproducibly so) could be helpful. Could reproducible mtime
in any way be designated as a "feature" that relies on the existence of a particular set of fields in the DICOM header (specifically, fields that might not be present in every valid DICOM)?
Am I correct in thinking that
mtime
is (re)set in the first place for reasons related todatalad
?
AFAIK not at all. If anything, datalad with its --fake-dates
fakes commit dates ;)
I think knowing why it needs to be set (and reproducibly so) could be helpful.
So that rerunning conversion doesn't produce unexpected difference, especially for large files such as dicom tarballs we are storing within sourcedata/
. That is the code you saw in https://github.com/nipy/heudiconv/issues/612#issuecomment-1428732715 and the code comment says it
dcm_time = get_dicom_series_time(dicom_list)
def _assign_dicom_time(ti):
# Reset the date to match the one from the dicom, not from the
# filesystem so we could sort reproducibly
ti.mtime = dcm_time
return ti
Another place where date/timestamps are used -- are populated in _scans.tsv
files. There I guess a lesser issue, in particular since we already add random string there.
Could reproducible
mtime
in any way be designated as a "feature" that relies on the existence of a particular set of fields in the DICOM header (specifically, fields that might not be present in every valid DICOM)?
may be, may be ... but again -- where do we have a use case where there is completely no date and time available - do you have an example? above we talked about where it just becomes two separate fields (for Date and Time). If no Date available, we can just choose date which BIDS assumes for annonymization if dates are prior 1925 so year could be produced reproducibly from smth like Jan 1 1900 + hash(UID) % (24*365) days
. Indeed we would loose sortability (no longer sure if it is valuable) but date would be clearly anonymized, and we use correct time. But then I am not sure if tar
would not start puking on those dates since that is before the epoch. So when storing to tar we need to offset by those 70 years, uff.
Indeed we would lose sortability (no longer sure if it is valuable)
Allowing loss of sortability would be easier. Though, ofc, I don't want to accidentally cause some unexpected issue in code/workflows that rely on this behavior.
where do we have a use case where there is completely no date and time available - do you have an example?
That was my understanding of the related case, #537 (specifically, this comment)
Though, I suppose, even in cases of anonymization, AcquisitionNumber (0020,0012)
is likely to be preserved. If mtime
is allowed to be based on something that is a bit arbitrary, then perhaps it would be sensible to set mtime
based on AcquisitionNumber
?
To avoid issues with epoch/tar
that you mention, the proposal would be
Jan 1 1970 + AcquisitionNumber (in years)
(try to anticipate cases of anonymization removing dates)hash(UID)
?)Another place where date/timestamps are used -- are populated in _scans.tsv files. There I guess a lesser issue, in particular since we already add random string there.
Thanks for bringing this up. I had been assuming that setting mtime
would be separate from values in acq_time
-- that when no date/datetime is available from the DICOM header then acq_time
would go to n/a
, as happens now. Just to be clear, do you think it would be better for the value in mtime
to match the value in theacq_time
column?
- If 1 and 2 are not possible, then error and provide message about missing information (alternatively, as a last resort rely on setting a totally fake date based on
hash(UID)
?)
ok
Just to be clear, do you think it would be better for the value in
mtime
to match the value in theacq_time
column?
if it is now -- then there is slight preference to keep it as that, but i do not really see such a need. those tar mtimes AFAIK are really just for reproducible tar generation.
Great, thanks for working through all this. I should be able to update the pull request by the end of the week, including elements in this discussion and the other requested revisions (e.g., python version, docstring style, etc)
:rocket: Issue was released in v0.13.0
:rocket:
Summary
I'm trying to use heudiconv to produce a bids folder from scans collected on a Siemens scanner that was recently upgraded to XA30. It's my impression that, to fill the
acq_time
column in the*scans.tsv
files, heudiconv uses the dicom tagAcquisitionDate
(here?). In the scans that I have, that tag isn't available.Here's a snippet from the logs, which shows a warning about the missing
AcquisitionDate
This is only a warning, and conversion proceeds. The main issue is that the
scans.tsv
file no longer has usable entries in theacq_time
column (rows filled withn/a
).Fortunately, there is an
AcquisitionDateTime
that could presumably fill the same role, although the formatting may need work.The fields used in get_dicom_series_time would also work (
SeriesDate
/SeriesTime
).Here is an example phantom scan with
AcquisitionDateTime
but notAcquisitionDate
: sub-sh_ses-221108_T1w.zipThis is perhaps related to #537, although again I'm only seeing a warning and not an error/crash
Platform details:
Choose one:
env.yml
psadil/heudiconv:221121 https://hub.docker.com/repository/docker/psadil/heudiconv
0.11.6 heuristic: reproin.py + a few entries in
protocols2fix
and hits fromfilter_dicom