notmatthancock / pylidc

An object relational mapping for the LIDC dataset using sqlalchemy.
https://pylidc.github.io
Other
105 stars 41 forks source link

Potential ambiguities in clustering of annotations #18

Closed fedorov closed 5 years ago

fedorov commented 5 years ago

@notmatthancock I was (again) reading this paper, and came across the example shown below (Fig.8):

Armato et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38, 915–931 (2011). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041807/

image

What is expected from the clustering of annotations into nodules, as implemented in pylidc, in this situation? I do not know which case was used in that illustration, so cannot quickly check what is actually happening.

notmatthancock commented 5 years ago

It depends on how the latter annotation is stored. If it is stored as separate annotations, then the clustering routine would yield a cluster with greater than 4 annotations. There is warning that is printed when this occurs (see the docs of the cluster function, in particular the min_tol parameter).

Yes, it's difficult to hunt this down without some identifier. Here are the scan ID's for which the clustering routine yields clusters with cluster size greater than 4:

[66, 100, 140, 205, 252, 334, 342, 370, 408, 613, 713, 790, 792, 840]

If you'd like, you can hunt this down further with the IDs above like so:

from pylidc import *
scan = query(Scan).get( 66 )
nodules = scan.cluster_annotations()
scan.visualize(annotation_groups=nodules)

I don't have the dataset on my laptop in order to confirm which of the IDs is the scan contains the case you've pointed out in your comment, but I strongly suspect it is one of the IDs above. I would also suspect for this particular nodule, the cluster routine would yield a group of 6 annotations, but I haven't the time at the moment to provide visual confirmation.

fedorov commented 5 years ago

This is very helpful. I should be able to quickly check those tomorrow. Thanks!

fedorov commented 5 years ago

55 is a good example case shown below (I reformatted the slice to show oblique plane, since otherwise there is no single slice showing lesion annotated in pieces and as a whole).

Two sub-groups of nodules:

image

Single nodule:

image

I believe saving those components as annotations belonging to the same nodule is fine, since due to the way the study was set up, we cannot determine which of the sub-nodules belong to which readers. And even if we knew that, we would not know if the reader was annotating the same nodule, or interpreted that as different nodules, since annotations do not have nodule identifiers. We also will declare in the documentation that assignment of annotations to nodules was done algorithmically in pylidc.

I think what pylidc is doing makes sense. I will examine other cases visually as well.

For the sake of completeness, here is the list of cases (generated with this script) where either warnings were raised, or there was a mismatch between the total number of annotations and the number of annotations included in pylidc-defined annotations clusters:

fedorov commented 5 years ago

@notmatthancock I was investigating what is going on in the situations where some annotations were not assigned to any cluster, and I see something I cannot explain.

In 132 I see the following:

image

Note that the annotations denoted by the red arrow are all assigned to the same nodule cluster (nodule 4), while the annotations seem to be in a location that does not correspond to the actual nodule.

Annotations with the white arrow have not been assigned to any cluster, but seem to line up with the nodule in the image.

Any thoughts what might be going on?

I am now going to look if I see the same in the pylidc-generated visualizations, but I had troubles to get that one working in the past.

fedorov commented 5 years ago

It looks like somewhere somehow some annotations were flipped.

image image

fedorov commented 5 years ago

Here's the code I used to debug this issue in pylidc proper:

import pylidc as pl
pid = 'LIDC-IDRI-0132'
scan = pl.query(pl.Scan).filter(pl.Scan.patient_id == pid).first()

annotations = pl.query(pl.Annotation).join(pl.Scan).filter(pl.Scan.patient_id == pid)
nodules = scan.cluster_annotations()

annotationsInNodulesList = []
for nCount,nodule in enumerate(nodules):
  print("  Nodule %d has %d annotations" % (nCount+1, len(nodule)))
  for a in nodule:
    annotationsInNodulesList.append(a.id)
if len(annotationsInNodulesList) != annotations.count():
  print("   WARNING: %d annotations unaccounted for!" % (annotations.count()-len(annotationsInNodulesList)))
  annotationsNotInNodules = []
  for a in annotations:
    if a.id not in annotationsInNodulesList:
      annotationsNotInNodules.append(a.id)
      print("%d (%s) not assigned to a nodule" % (a.id, a._nodule_id))
    else:
      print("%d (%s) assigned to a nodule" % (a.id, a._nodule_id))
#  print("IDs of annotations not in nodules: "+str(annotationsNotInNodules))

outliers = [ [i] for i in annotations if i.id in annotationsNotInNodules]

scan.visualize(annotation_groups=outliers)

Looks like something is not right here, since the annotations arrows do not seem to point to anything in the image that looks like a nodule:

image

I then tried to visualize the same nodule in pylidc as I do in Slicer, and it shows up in a different location than what I see in Slicer

outliers = [ [i] for i in anns if i.id == 45] # _nodule_id 18366
scan.visualize(annotation_groups=outliers)

pylidc:

image

Slicer:

image

BUT if I then take a nodule that shows up lining up in Slicer, it does not seem to correspond to a nodule in pylidc!

outliers = [ [i] for i in anns if i._nodule_id == "13112"]
scan.visualize(annotation_groups=outliers)

pylidc:

image

Slicer:

image

@notmatthancock can you explain what is going on with those nodules not assigned to clusters, and why they seem to correspond to locations in the image where there are no nodules? Is this some geometry transformation issue?

Once I understand what is going on the pylidc side, I will be in a better position to investigate what's going on in my conversion process.

notmatthancock commented 5 years ago

Thanks for all the details @fedorov. I'll look into this.

fedorov commented 5 years ago

I made a helper utility to examine cluster assignment and visualize specific annotations/groups of annotations here, if this helps: https://github.com/QIICR/lidc2dicom/blob/master/checkClusters.py

fedorov commented 5 years ago

I think I figured it out (unless I am missing something). All of those subjects that have annotations "unaccounted for" have more than 1 scan AND the number of annotations that have not been assigned to any clusters, which makes sense, since cluster_annotations is a function of Scan, so annotations from another scan are left out! Duh.

https://github.com/QIICR/lidc2dicom/blob/master/scansPerSubject.py

$ python scansPerSubject.py  
LIDC-IDRI-0132 has 2 scans
  Scan 1 has 12 annotations
  Scan 2 has 17 annotations
LIDC-IDRI-0151 has 2 scans
  Scan 1 has 3 annotations
  Scan 2 has 4 annotations
LIDC-IDRI-0315 has 2 scans
  Scan 1 has 18 annotations
  Scan 2 has 19 annotations
LIDC-IDRI-0332 has 2 scans
  Scan 1 has 8 annotations
  Scan 2 has 9 annotations
LIDC-IDRI-0355 has 2 scans
  Scan 1 has 1 annotations
  Scan 2 has 3 annotations
LIDC-IDRI-0365 has 2 scans
  Scan 1 has 4 annotations
  Scan 2 has 4 annotations
LIDC-IDRI-0442 has 2 scans
  Scan 1 has 10 annotations
  Scan 2 has 9 annotations
LIDC-IDRI-0484 has 2 scans
  Scan 1 has 5 annotations
  Scan 2 has 5 annotations

It was my fault not reading documentation. Sorry for bothering you with this!

notmatthancock commented 5 years ago

In your code, the line:

annotations = pl.query(pl.Annotation).join(pl.Scan).filter(pl.Scan.patient_id == pid)

retrieves all annotations for all scans having the given patient ID, which is not necessarily equivalent to retrieving all annotations for a given scan.

To retrieve all annotations for a given scan, we can simply do:

annotations = scan.annotations

which is syntactic sugar for the equivalent means of filtering on the foreign key:

annotations = pl.query(pl.Annotation).filter(pl.Annotation.scan_id == scan.id)

The id attribute is the unique primary key for every scan object while patient_id is not unique, which ties into your last comment. With the primary key, you can retrieve using the more concise syntax:

scan = query(Scan).get( scan_id )

with the drawback being of course that scan_id is just some arbitrary integer specific to pylidc.

The annotation_groups argument of scan.visualize function is intended to only accept the list of lists returned by calling cluster_annotations for the respective scan. In fact, this argument should really be a bool like indicate_annotations=True rather than accepting an arbitrary list in order to avoid potential funny business.

fedorov commented 5 years ago

Yes, thanks - I made the incorrect assumption initially that each subject has just one scan, and that's where it started.

notmatthancock commented 5 years ago

I see. Retrospectively a Subject class with a one to many relationship towards the Scan class would have made this more explicit.

notmatthancock commented 5 years ago

Closing this -- appears to be resolved via discussions above.