noaa-oar-arl / monetio

The Model and ObservatioN Evaluation Tool I/O package
https://monetio.readthedocs.io
MIT License
17 stars 30 forks source link

Hysplit concentration read speedup #177

Open TAdeJong opened 4 months ago

TAdeJong commented 4 months ago

For my use case of reading relatively large hysplit concentration grids, I found readfile to be much slower than the hysplit simulation itself. Looking at the code, 93% of the time was spend on iterative calls to xr.merge. I had to debug a bit, because it seems that, at least for my output of hysplit v5.2.2, the species and levels were flipped in order. Building a list of lists and calling xr.merge and xr.concat yielded a significant speedup of roughly 15x, very worthwhile for me.

Tests are passing, but tests are not actually testing this part of the code.

BTW: I think further speed up would be possible by lifting the conversion to a pandas dataframe out of the innermost function.

TAdeJong commented 4 months ago

OK, this is more complicated than I thought; This version breaks does not work for some output I have. WIP.

TAdeJong commented 4 months ago

This now handles empty columns, however, there might be more cases to consider that I do not know of.

zmoon commented 4 months ago

Thanks @TAdeJong, this sounds beneficial. @amcz do you have any initial thoughts?

TAdeJong commented 4 months ago

I ran into another edge case and fixed that. By peeling out some of the logic out of the inner loop I got another factor of ~2 speedup. Looking at a line profile, a lot of time is still spend in xarray merging and pandas indexing. I suspect another order of magnitude could be won by pre-allocating xr.DataArray's and indexing the underlying arrays directly while reading records, but that would require a more major rewrite.

amcz commented 3 months ago

Thanks! the reader could use some improvements and I appreciate this work on it. I will have time to review it and pull into hysplit development branch in sometime in beginning or mid August.