suyashkumar commented 3 years ago

This is an issue to track various investigations and changes to optimize readNativeFrames implementation, which I noticed in pprof profiles seemed to take up a majority of the CPU time in parsing DICOMs with Native pixel data.

More comprehensive tests and benchmarks were added in #160 to aid with this.

suyashkumar commented 3 years ago

157 led to some significant performance improvements by reducing slice allocations.

suyashkumar commented 3 years ago

Also, on this subject, @kaxap has observed that eliminating calls to binary.Read may reduce slice allocations and improve performance a bit, so that's something we should try. In particular, doing something like the following to read uint8, uint16, uint32s, etc:

buf := make([]byte, bitsAllocated/8)
for pixel := 0; pixel < pixelsPerFrame*samplesPerPixel; pixel++ {
  _ := io.ReadFull(d, buf)
  val := int(binary.LittleEndian.Uint16(buf))
}

The main advantage here is reusing the buffer slice (since bitsAllocated should be constant), instead of binary.Read creating one within.

kaxap commented 3 years ago

it is also possible to lazily load frames from file which will result in smaller memory footprint (some dicom files can be gigabytes in size)

suyashkumar commented 3 years ago

This is somewhat along the lines of #162--which is allowing interested parties to not read the pixel data, or perhaps, to pause reading before actually parsing the PixelData (which can be done with the Parser API, but not at the frame by frame level, but at the tag by tag level). We can also read in the bytes and do nothing with unpacking them until a user wants the image.Image, which is almost what we do today. I'm leaning towards something similar to current behavior + an option to not read PixelData (or any blocklist of tags potentially). We also already have a "streaming" option where frames can be consumed by a consumer as they are parsed, instead of having to wait for the full set of frames (or the whole dicom) to be parsed.

What do you think? Also, would love to learn more about your use cases if you have any particular ones. Cheers!

kaxap commented 3 years ago

Most of the cases all I want to do is to get some tags and a single frame rendered (for a thumbnail or a preview). Lazy mode (when we save offsets in file for each frame instead of reading all of them into RAM) looks very appealing in this case.

bbeckrest commented 3 years ago

Regarding memory usage, we've recently had our server (32GB RAM) run out of memory while attempting to parse a 1GB zipfile containing dicoms.

Using pprof, I've benchmarked parsing a 55MB zipfile containing 222 dicom images (each image 526KB, total ~117MB unzipped) and noticed readNativeFrames() ballooning memory usage to ~1.74GB.

Is this high memory usage (15x) expected?

Screen Shot 2021-05-18 at 12 28 45 PM

suyashkumar commented 3 years ago

Interesting, thank you for reporting! I don't think this is expected. Out of curiosity, what version of the library are you using? Does it include some of the recent optimizations included in this issue?

I'll take a closer look tonight and dive deep if needed this weekend.

bbeckrest commented 3 years ago

Thanks for looking into this! I don't have much experience profiling performance so I appreciate the help.

Our prod server that ran out of memory is using v0.4.5, which obviously does not have the recent optimizations.

However, my benchmark is using v1.0.3. We were hoping to see that upgrading the library would solve the issue.

suyashkumar commented 3 years ago

Thanks! So, something that sticks out to me is that we are packing smaller integers into the possibly larger go int type in the NativeFrame. This is done in order to have a clean API with the same static types for the caller no matter what the underlying integer type might be (e.g. 8-bit, 16-bit unsigned integers). I think this is useful from the user's perspective, so they have one NativeFrame type to deal with, and they can use regular 'ol ints.

However, this could easily cause large increases in memory compared to the size of a DICOM on disk--for example an underlying set of 8-bit integers in the DICOM would be read into a slice of normal Go ints--which might be 32 or 64bits wide (8 times more memory).

This is a situation in which generics in Go would be very helpful, so we could have a NativeFrame<intLikeType>.

Simple option?

One temporary and somewhat simple solution could be using uint32 in this API instead of int, if I can verify that DICOMs don't ever have 64 bits of data in them for NativeFrames (this saves 32 bits per int on 64-bit wide systems).

More involved option

Another option is exploring having a different NativeFrame type per size of integer. While it's not the cleanest as just being able to access the underlying data directly, I could envision something like the below, where we make NativeFrame an interface and not a struct. And, we have 3 structs that implement the interface that wrap the underlying int types.


type NativeFrame interface {
    Rows() int
    Cols() int
    BitsPerSample int
    GetPixel(x, y int) int
}

type nativeFrame8 struct {
    underlyingData uint8
}
...
type nativeFrame16 struct {
    underlyingData uint16
}
...

In the absense of generics, this will allow our memory representation to be as small as possible, while still exposing a consistent integer API for any entities that want to use the NativeFrame data. That being said, if a caller is copying out the data from the underlying NativeFrame, and grabbing it as ints, we'll run into the same problem of expanding it in memory anyway, just downstream from here (e.g. when turning this into an image later for example, if it goes through an intermediate int stage)...so might need to look at a couple of our use cases to see.

Let me know if folks have any thoughts or if you can shed light on any use case specific details.

Also, this might not explain all of the memory overhead--will look into other possible options soon too.

We should be able to use the BenchmarkReadNativeFrames to see the impact of various possible changes on the memory profile at least in that fn.

bbeckrest commented 3 years ago

Reading through the DICOM spec, it appears that allocating more than 32 bits is rare. If I'm understanding correctly, 64 bits are only used for Double Float Pixel Data.

FWIW, I've looked at ~90 chest CT scans that have been uploaded to our web app by doctors and they all allocate 16 bits.

Our use case involves determining all viable series & a single "best" series within a CT scan with 1-N series, which is eventually segmented. For each series, we parse the header to immediately filter out series that don't meet our criteria (not enough slices, too large slice thickness, etc.). After getting a list of viable series, we parse & store the pixel data for every slice in these series (this can take up a lot of memory if the scan has many viable series). After creating a 3D volume from the series' pixel data, we save jpg images as slices through the volume for each anatomical plane (coronal, axial, sagittal). Saving the images doesn't require much memory since we write each pixel sequentially, converting it from int to uint16 (similar to NativeFrame.GetImage()).

We are in the process of refactoring code for our use case described above so that we're only storing pixel data for one series at a time, which will cut down our memory usage. Obviously, it'd still be nice to reduce footprint while parsing.

If you don't mind, I'm going to take a stab at implementing the NativeFrame interface. I don't think we should worry too much about downstream memory usage with this implementation. The user can drop bits by converting the int returned by GetPixel() to whatever uint type they wish so int isn't stored long-term. However, we should still be mindful of the performance implications from this conversion.

suyashkumar commented 3 years ago

@bbeckrest That is my understanding as well re: the size of the pixel data.

Regarding the NativeFrame interface I proposed in https://github.com/suyashkumar/dicom/issues/161#issuecomment-846706345 I think it's reasonable but keeping in mind my performance comment.

With my performance comment, what I meant was calling GetPixel() int in certain circumstances will allocate a fullsize integer. In some cases, that integer may only be temporary, and be sent for garbage collection--like if iterating and processing pixel by pixel--this is the scenario in which we will see a drastic reduction in memory. In other scenarios we will not see a reduction in memory footprint, at the cost of a slightly more API. That might be an okay tradeoff, but it is probably still a narrow set of use cases. On the otherhand, not using int in the API, and instead using int32 always will always help in 64 bit systems--but the user being able to ask for any size int from the API probably helps here, but we need to be careful about overflow. Maybe we should have APIs for other int sizes, but if there's overflow, we reutrn an error. e.g. GetAsUInt16() (uint16, error)

Overall I think my proposal is ok https://github.com/suyashkumar/dicom/issues/161#issuecomment-846706345, and I think it should be a good tradeoff in the interim, espicially if it will help you with your usecase. I'm not 100% sure if it will solve all of your problems, if during the creation of the 3D Volume you ever hold the whole series in memory as full ints, but you of course have a better sense of that than I! I think that generics in Go will help a lot here so we can have PixelData<IntType> types (or something like that).

Please let me know if you'd like to send a PR for this, and let me know if there are any questions! No worries if you don't have time, I can knock this out at some point soon too. 😄

bbeckrest commented 3 years ago

@suyashkumar Yeah the proposal in https://github.com/suyashkumar/dicom/issues/161#issuecomment-846706345 would be helpful for my use-case since the volume could be stored as uint16.

Yesterday, I attempted to implement it but got stuck and realized I didn't really know what I was doing (here's my attempt). It's a decently large & important change so I'll probably defer to you here. I was having an issue implementing the NativeFrames interface while minimizing the API change.

suyashkumar commented 3 years ago

hi @bbeckrest! So there is another issue that may somehow be related to this (though I'm not exactly sure how or why). For files that have a weird bytesAllocated, we ended up not reading the pixel data correctly, which I'm addressing in #210. Another user of the library was experiencing around 1.7GB of memory being allocated (that wasn't being cleaned up properly by the Go gc) and their dicom was triggering this condition. It could still very well be an independent issue. Just figured I'd mention it!

Anyway, I do think I will circle back to my proposed solution above, and will take a look at your attempt as well!

ducquangkstn commented 2 years ago

though I'm not exactly sure how or why

We have a DICOM (33Mb) with bitAllocated = 1

using pydicom to read the pixel data results into an array with type=uint8 251 Mb ~ 33 Mb * 8
using this library results into about 2Gb ~ 251 Mb * 8 memory located, the output array is []int (64bit)

I guess the issue is from data type representation.

suyashkumar commented 2 years ago

Yep, in the case you describe that is the case (though in the comment above I was referring to another issue).

https://github.com/suyashkumar/dicom/issues/161#issuecomment-846706345 is where I lay out a possible interface for handling this in as clean of a way as possible. I can probably also add a getData interface{} that people can type assert to the right underlying type too if they just want the underlying data.

Though now that go has generics in 1.18 we may be able to adopt those to have a NativeFrame, but if we do, then folks will need to have a hard dep on at least go 1.18 and we would have generics exposed in a public API (which may not be that bad at this point...)

suyashkumar commented 3 months ago

I have been playing around with some loose hacking on what a multi-go-integer NativeFrame might look like here: https://github.com/suyashkumar/dicom/pull/315. It needs a lot of clean up and extra testing, but feel free to play with it if you have dicoms that were previously causing you issues.

Looking at the frame APIs again, I do think they need to be revisited in general, but this is something that should get the desired effect at least for now.

suyashkumar commented 3 months ago

So with the exploratory changes in #315, I've learned of another large contributor to this memory ballooning, and it's not the uint16->int64 expansion interestingly iiuc.

Right now, the way frames parse the data is a 2D array something like [][]uint16, with the first dimension being the pixel, and the inner dimension being the number of values per pixel (as a slice). So the way this is setup, we have 1 []uint16 slice per pixel. Well, it looks like an empty slice descriptor in Go itself is 24 bytes (!) (initial ref), so this means we're allocating 24 bytes per pixel to often times store 1-3 values per pixel (that are each 2 bytes only in the case of uint16)!

So, at a minimum, we might want to store a flat underlying representation of the values in the frame, and then offer APIs that help you navigate by pixel easily as needed. (That is, if we even want to process NativePixelData at all during processing).

Either way, even if we process it later, we'll want to be mindful of this. I haven't verified that this is the issue, but I'm guessing much of the memory is actually going to the slice headers as opposed to the uint16->int64 expansion!

suyashkumar commented 1 month ago

Some updates: in https://github.com/suyashkumar/dicom/pull/315#issuecomment-2274827280 you'll see an implementation that yields significant memory and cpu improvements for native pixeldata processing. I still need to iterate and clean up some tests / api though. Please let me know if you have any thoughts in general here though. This will be a pretty significant change to how Native PixelData is represented and interacted with.

suyashkumar / dicom

readNativeFrames Performance Optimizations #161

157 led to some significant performance improvements by reducing slice allocations.

Simple option?

More involved option