HTTP Errors for Imaging Data Commons

psavery commented 11 months ago

The NCI's Imaging Data Commons is a big repository (>38k studies) for cancer research.

Their studies are available via DICOMweb. See here for an example of viewing one of them.

I tried to access that same example, but I get a couple of HTTP errors. Try out the following code:

from wsidicom import WsiDicom, WsiDicomWebClient

# For this one: https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.227261840503961430496812955999336758586/series/1.3.6.1.4.1.5962.99.1.1334438926.1589741711.1637717011470.2.0
url = 'https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb'
study_uid = '2.25.227261840503961430496812955999336758586'
series_uid = '1.3.6.1.4.1.5962.99.1.1334438926.1589741711.1637717011470.2.0'

client = WsiDicomWebClient.create_client(url)

slide = WsiDicom.open_web(client, study_uid, series_uid)

It produces the following exception:

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb/studies/2.25.227261840503961430496812955999336758586/instances?00080016=1.2.840.10008.5.1.4.1.1.77.1.6&0020000E=1.3.6.1.4.1.5962.99.1.1334438926.1589741711.1637717011470.2.0&includefield=AvailableTransferSyntaxUID

I played around with that url with curl and found a couple of errors. Accessing that full URL via this command:

curl 'https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb/studies/2.25.227261840503961430496812955999336758586/instances?00080016=1.2.840.10008.5.1.4.1.1.77.1.6&0020000E=1.3.6.1.4.1.5962.99.1.1334438926.1589741711.1637717011470.2.0&includefield=AvailableTransferSyntaxUID'

Produces:

[{
  "error": {
    "code": 400,
    "message": "invalid QIDO-RS query: unknown/unsupported QIDO attribute: AvailableTransferSyntaxUID",
    "status": "INVALID_ARGUMENT"
  }
}
]

So it is raising an exception because we asked for the AvailableTransferSyntaxUID. However, if I remove that part of the url:

curl 'https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb/studies/2.25.227261840503961430496812955999336758586/instances?00080016=1.2.840.10008.5.1.4.1.1.77.1.6&0020000E=1.3.6.1.4.1.5962.99.1.1334438926.1589741711.1637717011470.2.0'

It raises another error:

[{
  "error": {
    "code": 400,
    "message": "generic::invalid_argument: SOPClassUID is not a supported instance or series level attribute",
    "status": "INVALID_ARGUMENT"
  }
}
]

So it is also complaining that we are asking for a SOPClassUID of WSI_SOP_CLASS_UID in the search filters.

If I remove that part of the url also:

curl 'https://proxy.imaging.datacommons.cancer.gov/current/viewer-only-no-downloads-see-tinyurl-dot-com-slash-3j3d9jyp/dicomWeb/studies/2.25.227261840503961430496812955999336758586/instances?0020000E=1.3.6.1.4.1.5962.99.1.1334438926.1589741711.1637717011470.2.0'

It works fine.

It's kind of annoying that an error is being raised for AvailableTransferSyntaxUID. It would be nice if it just returned an empty field if it was not available.

However, it would be really nice if we could support interacting with this DICOMweb server.

Let me know what your thoughts are, @erikogabrielsson.

erikogabrielsson commented 10 months ago

Hi @psavery, Do you know what DICOM server they are running? SOP class UID should be possible to use as a matching attribute, see Table Table 10.6.1-5. Required Matching Attributes

psavery commented 10 months ago

I see that... hmm.

My reading from their forums seems to indicate they are using a "Google Cloud Healthcare DICOMWeb service". They put it behind a proxy to prevent full downloads. See here.

psavery commented 10 months ago

I am able to view an example dataset using our viewer if I make the changes here (although I know those are not changes that would be merged).

fedorov commented 9 months ago

@psavery I am Andrey Fedorov, one of the leads of Imaging Data Commons.

First of all, thank you for using IDC, it is great to hear that you found it useful! Feedback from users like you and others in the wsidicom community is what we need and is very much welcomed as we work on developing this resources.

Second, let me use this opportunity to explain how IDC curates data. We make all of our data available in public storage buckets. IDC data is replicated across Google Storage and AWS S3 buckets, so you can choose from where to download it. You do not need login, and there is no egress charge. In order to navigate and search the files IDC is sharing, we provide searchable metadata index available via BigQuery SQL interface. The procedures for downloading data from IDC are described in this documentation page: https://learn.canceridc.dev/data/downloading-data.

More recently we have been working on idc-index python package to further simplify the process of searching and downloading IDC data: https://github.com/ImagingDataCommons/idc-index. We also have a 3D Slicer extension that provides interactive interface for accessing IDC data from the desktop: https://github.com/ImagingDataCommons/SlicerIDCBrowser (I have yet to update our documentation pages with these new tools!)

In addition to having our data in public storage buckets, we also ingest it into a DICOM store provisioned via Google Healthcare API. That DICOM store is behind the proxy mentioned earlier in the thread, and the primary purpose of that DICOM store is to support visualization of IDC data using viewers integrated with IDC Portal - OHIF (for radiology) and Slim (for slide microscopy).

There are two main reasons why the DICOM store is not accessed directly and is behind the proxy. First, due to limitations of Google Healthcare API, it is impossible to have non-authenticated access to that store, and we want to allow IDC users to view images without login. Proxy routes the data without requiring user login. Second, unlike free egress from the storage buckets, egress of data out of the cloud via DICOMweb interface needs to be paid from IDC budget and is not free. To control the costs we need to limit access to IDC data via DICOMweb. Proxy implements daily IP-based egress quotas.

It's a lot of content, but I wanted to try to give you that as a background before the following.

Catch and fix Google Healthcare API errors #149

Since you were accessing DICOM stores via the proxy, you should not extrapolate and assume that the limitations you are observing are germane to the Google Healthcare API and not the proxy. I encourage you to experiment with direct access to Google DICOM stores by setting up your own - it is very easy: https://cloud.google.com/healthcare-api/docs/how-tos/dicom - and I am happy to help you set it up.

access to IDC data via DICOMweb

We really want you to use IDC data! But unfortunately, due to the reasons above, IDC is currently not designed to enable DICOMweb access to its content. Download from storage buckets is the intended pathway for data access. I understand this is rather suboptimal for digital pathology workflows. We are looking for ways to enable unrestricted and unlimited access via DICOMweb, and there is some hope we will be able to do it, but it is difficult at this point to estimate when this would be completed.

Finally, I encourage anyone using IDC to reach out to us via IDC forum. We came across this discussion by a fortunate accident, but we are here to help and are very interested in user feedback. It is very important for me to know that there is interest in the community in DICOMweb access to IDC data.

@psavery I would be very interested to have a meeting with you to discuss the related topics! We have been working on several joint projects with Kitware with @aylward @jcfr @thewtex among others, and I would love to learn more about your use cases. Please reach out via andrey dot fedorov at gmail to coordinate the time!

Sorry for the long post, but I hope this helps!

psavery commented 9 months ago

Hi @fedorov,

Nice to meet you!

This work was done in support of large_image, which includes a DICOMweb viewer.

I will defer this discussion to the project lead, @manthey.

Thank you! Patrick

psavery commented 9 months ago

By the way, for the record, I do believe #149 was fixing an issue specifically for Google Healthcare API (not something specific to the IDC proxy), because I saw the same issue mentioned in a few places on their GitHub repositories (one of which was 4 years ago here).

fedorov commented 8 months ago

@psavery to be clear, I just wanted to suggest that if you want to investigate a suspected bug in Google Healthcare, it is advisable to do this by directly interacting with a GHC DICOM store, without having proxy in the middle.

Also, we discussed this with David Clunie @dclunie and here is his perspective on the actual issue. Would be good if you could comment on item 3!

AvailableTransferSyntaxUID is an optional parameter that indicates what the server might be able to supply - there should be no expectation that it is supported (since it is relatively new) and no dependency on its value(s)

TransferSyntaxUID is not an appropriate surrogate since it is part of the PS3.10 metainformation about a particular returned dataset, and not necessarily reflective what the server has or might be able to transform it into. It might ot might not be returned, and might or might not be what the caller wants to know.

Why is wsidicom asking for this and what behavior depends on it?

Finally, in part in response to your use case, we amended IDC proxy policy to now allow egress without restricting to IDC viewer only. The per-IP daily quota still applies. Please see the updated proxy policy here: https://learn.canceridc.dev/portal/proxy-policy. I hope this helps you and other users interested in using DICOMweb for accessing IDC data!

imi-bigpicture / wsidicom

HTTP Errors for Imaging Data Commons #141