Arbitary missing dicom files

markloyman commented 7 years ago

This issue is sort of a follow up for the problem that was mentioned in pull-request https://github.com/pylidc/pylidc/pull/5 (similiar to issue https://github.com/pylidc/pylidc/issues/2)

In short: Some of the .dcm files are missing (not only 0000.dcm), which results in an error.

I checked with the LIDC online access to make sure that the problem isn't the result of failed downloads:

Some cases are like patient 1009, where all files are present, but the first index is 3. Or patient 0777, where all 187 files are present but start at number 5.
In patient 0048, the missing file (298) is the last one, but the total number of files is again consistant with the online dataset (297).

At the time, it seemed to me that some of the missing files were in the middle. However, I don't see any such cases in my log.

Code: scan= pl.query(pl.Scan).filter(pl.Scan.series_instance_uid == '1.3.6.1.4.1.14519.5.2.1.6279.6001.250863365157630276148828903732' ).first() scan.annotations[0].uniform_cubic_resample(side_length = 100)

Error:

Loading dicom files ... This may take a moment.

FileNotFoundError Traceback (most recent call last)
in () ----> 1 scan.annotations[0].uniform_cubic_resample(side_length = 100) E:\Anaconda2\envs\LIDC\lib\site-packages\pylidc\Annotation.py in uniform_cubic_r esample(self, side_length, resample_vol, irp_pts, return_irp_pts, verbose) 768 769 # Load the images. Get the z positions. --> 770 images = self.scan.load_all_dicom_images(verbose=verbose) 771 img_zs = [float(img.ImagePositionPatient[-1]) for img in images] 772 img_zs = np.unique(img_zs) E:\Anaconda2\envs\LIDC\lib\site-packages\pylidc\Scan.py in load_all_dicom_images (self, verbose) 315 images = [] 316 for dicom_file_name in sorted_fnames: --> 317 with open(os.path.join(path, dicom_file_name), 'rb') as f: 318 images.append( dicom.read_file(f) ) 319 return images FileNotFoundError: [Errno 2] No such file or directory: 'E:/Library/Datasets/DOI \\LIDC-IDRI-0048\\1.3.6.1.4.1.14519.5.2.1.6279.6001.2081777976054741511065201243 06\\1.3.6.1.4.1.14519.5.2.1.6279.6001.250863365157630276148828903732\\000298.dcm '

Full list of missing files that I encountered:

LIDC-IDRI-0048\1.3.6.1.4.1.14519.5.2.1.6279.6001.208177797605474151106520124306\1.3.6.1.4.1.14519.5.2.1.6279.6001.250863365157630276148828903732\000298.dcm
LIDC-IDRI-1011\1.3.6.1.4.1.14519.5.2.1.6279.6001.287560874054243719452635194040\1.3.6.1.4.1.14519.5.2.1.6279.6001.272123398257168239653655006815\000001.dcm
LIDC-IDRI-1011\1.3.6.1.4.1.14519.5.2.1.6279.6001.287560874054243719452635194040\1.3.6.1.4.1.14519.5.2.1.6279.6001.272123398257168239653655006815\000003.dcm
LIDC-IDRI-1011\1.3.6.1.4.1.14519.5.2.1.6279.6001.287560874054243719452635194040\1.3.6.1.4.1.14519.5.2.1.6279.6001.272123398257168239653655006815\000002.dcm
LIDC-IDRI-1010\1.3.6.1.4.1.14519.5.2.1.6279.6001.145373944605191222309393681361\1.3.6.1.4.1.14519.5.2.1.6279.6001.550599855064600241623943717588\000001.dcm
LIDC-IDRI-1010\1.3.6.1.4.1.14519.5.2.1.6279.6001.145373944605191222309393681361\1.3.6.1.4.1.14519.5.2.1.6279.6001.550599855064600241623943717588\000002.dcm
LIDC-IDRI-1009\1.3.6.1.4.1.14519.5.2.1.6279.6001.849069697860879761549990488101\1.3.6.1.4.1.14519.5.2.1.6279.6001.855232435861303786204450738044\000001.dcm
LIDC-IDRI-1009\1.3.6.1.4.1.14519.5.2.1.6279.6001.849069697860879761549990488101\1.3.6.1.4.1.14519.5.2.1.6279.6001.855232435861303786204450738044\000002.dcm
LIDC-IDRI-0777\1.3.6.1.4.1.14519.5.2.1.6279.6001.226719444846209417020566423366\1.3.6.1.4.1.14519.5.2.1.6279.6001.192256506776434538421891524301\000004.dcm
LIDC-IDRI-0777\1.3.6.1.4.1.14519.5.2.1.6279.6001.226719444846209417020566423366\1.3.6.1.4.1.14519.5.2.1.6279.6001.192256506776434538421891524301\000001.dcm
LIDC-IDRI-0777\1.3.6.1.4.1.14519.5.2.1.6279.6001.226719444846209417020566423366\1.3.6.1.4.1.14519.5.2.1.6279.6001.192256506776434538421891524301\000005.dcm
LIDC-IDRI-0777\1.3.6.1.4.1.14519.5.2.1.6279.6001.226719444846209417020566423366\1.3.6.1.4.1.14519.5.2.1.6279.6001.192256506776434538421891524301\000002.dcm
LIDC-IDRI-0777\1.3.6.1.4.1.14519.5.2.1.6279.6001.226719444846209417020566423366\1.3.6.1.4.1.14519.5.2.1.6279.6001.192256506776434538421891524301\000003.dcm
LIDC-IDRI-0127\1.3.6.1.4.1.14519.5.2.1.6279.6001.195975724868929317649402600442\1.3.6.1.4.1.14519.5.2.1.6279.6001.229343399861261429237689489892\000001.dcm

notmatthancock commented 7 years ago

Thanks for the bug report.

When I wrote the code to populate the sqlite database for this library, I assumed that the file names would always be the same. Under this assumption, I hard-coded an attribute to the Scan object, sorted_dicom_file_names, in order to eliminate the sort step from the DICOM loading function. Also, some scans are weird in that they have what appears to be duplicate slices with the same z-index. So the hard-coded attribute eliminated the need to sort the data every time as well as "prune" the duplicate slices if they exist.

It looks like hard-coding this was a bad idea retrospectively, but I think we can fix it by making the load_all_dicom_files function more general by loading and sorting on-the-fly.

Will you replace the load_all_dicom_images function with the following (in Scan.py) and let me know how it effects your issue?

def load_all_dicom_images(self, verbose=True):
    """
    ....
    """
    if verbose: print("Loading dicom files ... This may take a moment.")

    path = self.get_path_to_dicom_files()
    fnames = [fname for fname in os.listdir(path)
                        if fname.endswith('.dcm')]
    images = []
    for fname in fnames:
        with open(os.path.join(path, fname), 'rb') as f:
            image = dicom.read_file(f)
            images.append(image)

    # ##############################################
    # Clean multiple z scans.
    #
    # Some scans contain multiple slices with the same `z` coordinate 
    # from the `ImagePositionPatient` tag.
    # The arbitrary choice to take the slice with lesser 
    # `InstanceNumber` tag is made.
    # This takes some work to accomplish...
    zs    = [float(img.ImagePositionPatient[-1]) for img in images]
    inums = [float(img.InstanceNumber) for img in images]
    inds = range(len(zs))
    while np.unique(zs).shape[0] != len(inds):
        for i in inds:
            for j in inds:
                if i!=j and zs[i] == zs[j]:
                    k = i if inums[i] > inums[j] else j
                    inds.pop(inds.index(k))

    # Prune the duplicates found in the loops above.
    zs             = [zs[i]     for i in range(len(zs))     if i in inds]
    dcm_file_paths = [fnames[i] for i in range(len(fnames)) if i in inds]
    dcm_imgs       = [images[i] for i in range(len(images)) if i in inds]

    # Sort everything by (now unique) ImagePositionPatient z coordinate.
    sort_inds = np.argsort(zs)
    images    = [images[s] for s in sort_inds]
    # End multiple z clean.
    # ##############################################

    return images

markloyman commented 7 years ago

Hi, thanks for the quick solution. :)

I've tested it on a couple of instances, and it seems to work great. Now, I'm lauching my original code, that cycles all annotation.

I will update later on whether there were any unexpected complications.

markloyman commented 7 years ago

Successfully read all nodule data.

Thank you. pylidc has been a tremendous help for me.

notmatthancock commented 7 years ago

Ok, glad to hear the fix appears to be working and that the library has been useful to you.

I think there's still a bug with the code above, which deals with the case where there may be duplicate z-index slices, specically, the line,

dcm_imgs = [images[i] for i in range(len(images)) if i in inds]

should be changed to,

images = [images[i] for i in range(len(images)) if i in inds]

and the line preceding it can be removed.

These lines deal with the scans that contain duplicate z-slices. The code won't error as you found, but you might (or not?) get weird results, otherwise.

I'll have to double check by visual inspection that the code is working correctly for the "duplicate z" cases. If this code handles those cases correctly, I will add this fix to the next version to be released on pip.

notmatthancock commented 7 years ago

Ok mark, the fix on the latest pip version, so you can grab it by pip install --upgrade pylidc.

markloyman commented 7 years ago

Well, apparently I didn't read all nodule data. Just tried to re-run my code and I encountered a problem with duplicates pruning:

in load_all_dicom_images
inds.pop(inds.index(k))
AttributeError: 'range' object has no attribute 'pop'

inds is a range, which in python 3 is an iterator, so you can't modify it. Simple fix by changing the initialization to inds = list(range(len(zs))).

notmatthancock commented 7 years ago

Thanks. I've added the fix to latest pip version.

notmatthancock / pylidc

Arbitary missing dicom files #7

Loading dicom files ... This may take a moment.