Unity Jobs and Burst - Githubissues

mlavik1 / UnityVolumeRendering

Volume rendering, implemented in Unity3D. Want to support the project? Donate a small sum to Save The Children(https://www.savethechildren.net/) or another charity, and send me a message, and I will be greatly motivated!

https://fosstodon.org/@unityvolrend

MIT License

458 stars 129 forks source link

Unity Jobs and Burst #204

Open SitronX opened 1 year ago

SitronX commented 1 year ago

Hello @mlavik1 , I have been recently researching Unity Jobs and Burst compiler. And i tried to incorporate it here with loading, processing data, making textures, etc.... By how your library processes the data, it is really nicely setup for parallel workflow. I have to say results are really nice. My huge dataset with 730 slices previously loaded and processed data around 2:30 minutes. Now it lasts around 17-20 seconds with Jobs+Burst and CPU is not even maxxed out thanks to SIMD etc... Normal datasets are blazing fast in therms of seconds.

I also have other processing and loading of segmentation in my project, so here loading and processing would be even faster.

Default loading - 2:30 minutes	Multithreaded Jobs - 55 seconds	Jobs+Burst - 17 seconds

Due to the vectorization and other stuff, it is kinda hard to keep exact track of loading state and it would slow the process by a ton, so i removed the percent reporting and only have stages reporting now, but i think it is kinda pointless to track exact percent when with Burst everything is matter of seconds.

The downside might be that Burst is supported from Unity 2018+ i think, so older Unity versions would not work.

I know you made bunch of progression trackers, even for Gradient texture, but i think having blazing fast load is better than watching the loading indicator :D

What do you think about this?

Edit: Also the question. Is there a reason why Gradient is saved in Color, instead of Color32? Gradient texture takes extremely lot of space and with the use of Color32 the memory usage would be 1/4. To me, the Color32 looks the same and i dont see any loss of detail. Was there a particular reason why you used Color class?

mlavik1 commented 1 year ago

Hi again! Thanks for the suggestion!

The downside might be that Burst is supported from Unity 2018+ i think, so older Unity versions would not work.

Well, the biggest issue with Burst would be that it's not supported on all platforms (WebGL), but since the Jobs System can be used without Burst then that's not a big issue :) And looks like the Jobs system is available in 2018+? I don't think I'll be supporting anything older than that in future releases anyway, so that should be fine!

I'm curious about your numbers and implementation though. Which dataset importers are you testing with? I would expect that most of the work that the SimpleITK importer does can't be made any faster using the jobs system (if it's even possible to use it form a job?), since it deals with managed data and the underlying implementations is already in native code. So in your case, is it the post-processing that takes the most time? More specifically, gradient calculation? Do you know which parts takes the most time?

Things like gradient and TF calculation is definitely something that could be improved using jobs! Though, I'm not sure if we even need to store this data on the CPU, so I've also thought of the possibility of using compute shaders for that, but again WebGL is annoying with its lack of compute shader support :sweat_smile:

So yes, I think this is a great suggestion! Though, I'm curious about exactly what code you moved to the jobs system (not the actual SimpleITK-related stuff I guess?). I'll also do some investigation on my side :)

Also the question. Is there a reason why Gradient is saved in Color, instead of Color32?

That has been there since day 1 (when this was just a simple weekend project I was not planning to touch again :joy: ) so I can't say I remember for sure, but I believe it was because of precision. I haven't put any work into investigating if switching to Color32 has a visual impact. But in theory there can potentially be a big difference, between 32 bit floats and 8 bit unsigned integers. Though, whether or not you will notice a difference would likely depend on both your data and how you use the gradients. Since the 2D TF editor is still very basic, it's not that easy to make a good test case :sweat_smile: And it might be more noticable if we change how gradients are calculated, as suggested in #198. That would be even slower though, so again - using the Jobs System or compute shaders would likely be needed :)

SitronX commented 1 year ago

so I've also thought of the possibility of using compute shaders for that, but again WebGL is annoying with its lack of compute shader support

Ye i suppose burst can be just disabled if the platform doesnt support it. Compute shaders sounds nice, but also like overkill to me :) The majority time of waiting in the app now is in between bursts, when waiting for the next burst job to be initialized so i also think GPU would not help much more here.

I'm curious about your numbers and implementation though. Which dataset importers are you testing with? I would expect that most of the work that the SimpleITK importer does can't be made any faster using the jobs system (if it's even possible to use it form a job?)...

Yes, there is like no point of doing anything with simpleITK. SimpleITK is extremely quick even for very large datasets. I only added burst to all demanding things in postprocessing in VolumeDataset.cs . So now downscaleData, finding min/max, creating default texture, creating gradient texture have burst variants (in my project i also have bunch of other postprocessing regarding segmentation and separate layers where burst helped by a ton) . Creating gradient took always the longest time.

I haven't put any work into investigating if switching to Color32 has a visual impact. But in theory there can potentially be a big difference, between 32 bit floats and 8 bit unsigned integers.

Like in theory yes, but practically to me it seems worth it to switch to Color32 with the amount of memory saved. The one thing is how precisely it is stored, the other thing if a human eye can even see such a precision. It also means extremely large datasets will probably not crash with OutOfMemory exception anymore, cause the gradient previously took extremely lot of space. But it might be because i am mainly working with 1D TF, so it might be different with 2D.

mlavik1 commented 1 year ago

Yes, there is like no point of doing anything with simpleITK. SimpleITK is extremely quick even for very large datasets. I only added burst to all demanding things in postprocessing in VolumeDataset.cs ...

Ok, thanks for the info! Which dataset did you test with btw? One of the ircadb ones? I guess it must have been a rather big one. But yes, this is definitely a great use case for the Jobs system! I suppose we could easily combine it with the current async code as well. We probably don't want to block the main thread while the importers are doing their work (especially for the OpenDICOM importer, which I know some people use on WebGL and other platforms where the SimpleITK integration isn't available), so we could probably trigger the jobs after the importer has done its work and then have it await the job? Not sure how you've done it, but I'll take a look at your repo :)

Like in theory yes, but practically to me it seems worth it to switch to Color32 with the amount of memory saved...

Yes, it's not unlikely that Color32 is good enough for 2D transfer functions as well! But I suppose we could support both, and add a setting for this? More code to maintain though, so I'll do some testing and consider just switching to Color32 completely as you suggested.

Also, we might also want to consider changing the texture format on the GPU as well. Currently it uses half precision floats when supported. If we switched to Color32 it might be better to change the texture format to something that matches, thus reducing memory usage by quite a lot - and probably performance as well (less data => less overhead for texture fetch). Did you try that already maybe? (for example TextureFormat.RGBA32 )

SitronX commented 1 year ago

Which dataset did you test with btw? One of the ircadb ones?

That 2:30 minutes is for internal hospital specific dataset, which is huge with 730 slices. But even for this dataset SimpleITK is very fast in therms of single seconds. Ircads are piece of cake, the whole loading and processing for them is usually under 10 seconds :D

I suppose we could easily combine it with the current async code as well.

Yes i have it combined with async methods, it works great. You can take a look here if you want. This class will be probably quite different from your version, cause there are a lot of modifications. But you could see how it works on CreateTextureInternalAsync method quite easily, other methods work with very similar logic.

But I suppose we could support both, and add a setting for this?

Ye, that is great idea if it turns out there might be slight visual difference.

Also, we might also want to consider changing the texture format on the GPU as well.

Yep, i am already using RGBA32 texture format with Gradients in my project. Havent done performance tests yet, but there should be some improvement.

mlavik1 commented 1 year ago

That 2:30 minutes is for internal hospital specific dataset, which is huge with 730 slices. But even for this dataset SimpleITK is very fast in therms of single seconds. Ircads are piece of cake, the whole loading and processing for them is usually under 10 seconds :D

Yes, SimpleITK is lightning fast! That's a really big improvement in import time then :grin: I would expect it to be faster with the job system and Burst, but this is even more than I had imagined. Thanks a lot for sharing!

Yes i have it combined with async methods, it works great. You can take a look here if you want...

Thank you very much! I'll have a look at it sometime soon. I think this fits nicely together with some other things I wanted to do, such as improving the gradient calculations (#198) and maybe adding some filters (gaussian blur, etc.).

Yep, i am already using RGBA32 texture format with Gradients in my project. Havent done performance tests yet, but there should be some improvement.

Ok, good! I suspect it's a bit faster, because I saw a big improvement when switching to half precision floating point format.

And again, thanks a lot for all you feedback, suggestion and code contributions so far! I've heard a lot of positive feedback regarding the async loading and sphere cutout tool from other people using this plugin :D

SitronX commented 1 year ago

Yes, SimpleITK is lightning fast! That's a really big improvement in import time then 😁 I would expect it to be faster with the job system and Burst, but this is even more than I had imagined. Thanks a lot for sharing!

Oh no, im sorry. My formulation of that sentence was confusing. What i ment is, that SimpleITK is so fast, that there is not really point of doing anything with it, because it only takes few seconds to import that huge dataset, and the rest of the time was spent in postprocessing. So that 2:30 ->17 second was achieved by simply adding jobs&burst to postprocessing. I havent messed with SimpleITK.

I think this fits nicely together with some other things I wanted to do, such as improving the gradient calculations (https://github.com/mlavik1/UnityVolumeRendering/issues/198) and maybe adding some filters (gaussian blur, etc.).

Filters sounds really nice. Jobbing&bursting the postprocessing should be fairly straightforward. Do you want me to make a PR or do you want to do it yourself, with adding bunch of stuff around it and disclaimers of Unity 2018+ ?

And again, thanks a lot for all you feedback, suggestion and code contributions so far! I've heard a lot of positive feedback regarding the async loading and sphere cutout tool from other people using this plugin :D

Haha that is nice to hear :D Take these contributions as a big thanks for the amazing library of yours, that made it possible to make an app in reasonable time-window, that surgeons plan to use in real clinical environments :)

mlavik1 commented 1 year ago

Oh no, im sorry. My formulation of that sentence was confusing.

Oh, I got what you meant :) Maybe my reply was a bit confusing haha. What I meant is: 2:30 ->17 is a great improvement, even more than I would have expected from adding jobs&burst to postprocessing. So that's great news! Especially since I'm planning to experiment with slightly more computionally expensive gradient estimates. I'm also wondering if it would be useful to add support for dynamically updating the gradients based on clipping planes and transfer functions?

Do you want me to make a PR or do you want to do it yourself

I was thinking of starting on this maybe next week, but if you already have something that's almost ready for PR, then feel free to make one! :+1: I might end up modifying it a bit, since I'd like to do some clean-up work (splitting up the VolumeDataset , etc.. would probably be nice if that class only contains the data, and all the calculations are moved to separate classes).

I'm working on something else related to gradients and lighting now, that I'm going to merge in first: Adding gradient thresholds for lighting, to avoid shading parts of the volume where the gradients are near-zero, which creates a lot of noise. It will be an adjustable setting:

And a gradient threshold for the isosurface rendering mode, to allow you to filter out low-gradient parts of the volume:

Would this be relevant for your project? If so I'll add you as a reviewer, if you don't mind :)

Haha that is nice to hear :D Take these contributions as a big thanks for the amazing library of yours, that made it possible to make an app in reasonable time-window, that surgeons plan to use in real clinical environments

Thanks a lot! And wow, that's really exciting :D Don't hesitate to let me know if you run into any issues or limitations.

SitronX commented 1 year ago

I'm also wondering if it would be useful to add support for dynamically updating the gradients based on clipping planes and transfer functions?

Sure, more user settings is better :D I wonder how different will it look.

was thinking of starting on this maybe next week, but if you already have something that's almost ready for PR, then feel free to make one! 👍

Oh, if you are already getting to it i would probably leave it to you :D I havent started with adding my changes to the fork for PR yet.

Would this be relevant for your project? If so I'll add you as a reviewer, if you don't mind :)

That is cool. Yes please :)

mlavik1 commented 1 year ago

I again @SitronX ! I've been looking into the job system task a little bit recently. I ran into one issue I'm not yet sure how to solve: Jobs need to work with native containers (NativeArray, etc.). So we need to first copy the data over to a NativeArray, which is what you do here. Normally that should be fine. However, I know there are some people using this plugin with relatively large datasets on memory-constrained platforms, where doing another such large allocation may cause problems...

Maybe it would be better to replace the original data array, with a NativeArray and just pass that directly to the jobs (and use a lock to make sure it's not modified elsewhere before the job finishes)? However, Unity's default serialisation doesn't seem to work with native containers, and AFAIK there is no way to do custom serialisation, except using OnBeforeSerialize to copy the data to a normal array - which requires another large memory allocation... So switching to NativeArray would break serialisation. Do you have any suggestions?

Of course, the original data array being this large is another problem. Even if you have a lot of free memory, you may not be able to do one massive linear memory allocation, so dividing the data up into chunks could help. But I'd also like to avoid doing the extra memory allocation if possible.

SitronX commented 1 year ago

Hi @mlavik1

Normally that should be fine. However, I know there are some people using this plugin with relatively large datasets on memory-constrained platforms, where doing another such large allocation may cause problems...

Well one way is, that it could be just for users that have enough of memory. Users with low memory, would just use the old method?

Maybe it would be better to replace the original data array, with a NativeArray and just pass that directly to the jobs (and use a lock to make sure it's not modified elsewhere before the job finishes)?

Yep, this is a good way to avoid that issue from first point, to have everything in NativeArray. I dont know how you want to approach this, but in my solution i didnt launch all jobs at once, i still have it in stages. When one stage finishes, the jobs from next stage are only then started, so the same data is not affected by multiple jobs at once and thus the lock is not really needed. Launching all jobs at once with locks would also make progress reporting very hard.

However, Unity's default serialisation doesn't seem to work with native containers, and AFAIK there is no way to do custom serialisation, except using OnBeforeSerialize to copy the data to a normal array - which requires another large memory allocation... So switching to NativeArray would break serialisation. Do you have any suggestions?

Well, this serialisation you are talking about, is it regarding saving the data of the volumeObjects? In that case, i havent actually used it in my project, so i didnt look much into it. But i think Nativearrays and Nativecontainers should be serialisable, because there is a thing as NetCode for ECS which should heavily rely on serialisation of native containers to send data. I also found this where there are some examples of serialisation of NativeArrays even in NetCode for gameObjects. It is not the default way to serialise it, but can this be used in your case?

Of course, the original data array being this large is another problem. Even if you have a lot of free memory, you may not be able to do one massive linear memory allocation, so dividing the data up into chunks could help. But I'd also like to avoid doing the extra memory allocation if possible.

Ye, it is probably good idea to split the original massive array into multiple ones. Tbh i never had any problems with allocation myself, even when i used some massive datasets, but it doesnt hurt to split it just to be safe.

mlavik1 commented 1 year ago

Well one way is, that it could be just for users that have enough of memory. Users with low memory, would just use the old method?

Yes, I'm starting to think this is the best solution. There's going to be a lot more code to maintain though.. But it's probably worth it :)

Launching all jobs at once with locks would also make progress reporting very hard.

Oh yes, that's right. What I meant is that I might need a lock to make sure it doesn't break if the data array is modified from outside the jobs (it's public, and users are free to modify it as they want). But that shouldn't be a big issue anyway.

Well, this serialisation you are talking about, is it regarding saving the data of the volumeObjects?

I should probably have made that more clear, since there are different types of serialisation in Unity now :sweat_smile: What I was referring to is the serialisation of GameObjects and ScriptableObjects. In other words, when the datasets are saved in an asset or in the scene. The issue is not the actual serialisation of the NativeArray, but getting it to be stored with the asset/scene. It seems like there's still no nice way to customise this in Unity. The only solution appears to be to implement OnBeforeSerialize and copy the data over to a managed array that Unity knows how to serialise - thus duplicating the data.. NetCode seems to be a whole different world though - there you can customise the serialisation (for RPCs, etc.) a lot more.

I've thought a lot about it, but for now maybe the best solution is to:

Keep the old managed data array.
Use a temporary NativeArray for the jobs, where we copy over the data on demand.
Keep the old implementations (not using jobs) as a fallback if you can't allocate enough memory for the jobs.

Another thing I thought about is to write the data to a file in StreamingAssets, but it feels a bit over-kill and would be hard to maintain (delete when no longer referenced, etc..). Though, something like that would probably be needed for large time series dataset (which I've also been requested to implement haha). But for now, I suppose the above might be a good solution? I guess it's mostly the same as you have done, except that I will use the old implementation as a fallback.

SitronX commented 1 year ago

What I was referring to is the serialisation of GameObjects and ScriptableObjects. In other words, when the datasets are saved in an asset or in the scene. The issue is not the actual serialisation of the NativeArray, but getting it to be stored with the asset/scene. It seems like there's still no nice way to customise this in Unity. The only solution appears to be to implement OnBeforeSerialize and copy the data over to a managed array that Unity knows how to serialise - thus duplicating the data.. NetCode seems to be a whole different world though - there you can customise the serialisation (for RPCs, etc.) a lot more.

Aha, well sadly i dont have any experience with this, so i cant really help. Maybe some 3rd party open-source serializers exists for this purpose? Or maybe that Netcode could somehow help even if it looks intimidating? Sometimes it is not bad to just borrow a small part from ecosystems like this. Example is right here with Bursts and Jobs that are in DOTS ecosystem, but we purposely ignore the ECS part. It is just and idea though, i havent used NetCode yet so i have no idea how complex it is :D

But if there is no other way to do it, i guess duplicating the data with backup solution in case of low memory is the only way for now.

SnowWindSaveYou commented 1 year ago

Oh no, im sorry. My formulation of that sentence was confusing.

Oh, I got what you meant :) Maybe my reply was a bit confusing haha. What I meant is: 2:30 ->17 is a great improvement, even more than I would have expected from adding jobs&burst to postprocessing. So that's great news! Especially since I'm planning to experiment with slightly more computionally expensive gradient estimates. I'm also wondering if it would be useful to add support for dynamically updating the gradients based on clipping planes and transfer functions?

Do you want me to make a PR or do you want to do it yourself

I was thinking of starting on this maybe next week, but if you already have something that's almost ready for PR, then feel free to make one! 👍 I might end up modifying it a bit, since I'd like to do some clean-up work (splitting up the VolumeDataset , etc.. would probably be nice if that class only contains the data, and all the calculations are moved to separate classes).

I'm working on something else related to gradients and lighting now, that I'm going to merge in first: Adding gradient thresholds for lighting, to avoid shading parts of the volume where the gradients are near-zero, which creates a lot of noise. It will be an adjustable setting:

And a gradient threshold for the isosurface rendering mode, to allow you to filter out low-gradient parts of the volume:

Would this be relevant for your project? If so I'll add you as a reviewer, if you don't mind :)

Haha that is nice to hear :D Take these contributions as a big thanks for the amazing library of yours, that made it possible to make an app in reasonable time-window, that surgeons plan to use in real clinical environments

Thanks a lot! And wow, that's really exciting :D Don't hesitate to let me know if you run into any issues or limitations. i think is better to add one more square for this parameter
I found it difficult to adjust this number so I added an extra square in shader, after that i just realize later that editor already had a square (:з」∠)

mlavik1 commented 1 year ago

i think is better to add one more square for this parameter I found it difficult to adjust this number so I added an extra square in shader, after that i just realize later that editor already had a square (:з」∠)

Thanks for the feedback @SnowWindSaveYou - I think you're right. I'll give that a try when I'm back form holiday, thanks :)