pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.08k stars 298 forks source link

Documentation takes too long to build #2864

Closed djhoese closed 2 months ago

djhoese commented 4 months ago

We've been noticing that documentation on readthedocs has been timing out. Off the top of my head I believe the last time I checked RTD gives free projects 15 minutes for a build (30 minutes if you pay). The area preview generation (using cartopy) is taking a little over 4 minutes by itself (that's my local machine, not a cheap CI runner that RTD might be using). On top of that the resulting HTML file (see below) is ~23MB in size because it has all the SVG images embedded. I don't know how others feel but that seems like a lot. I mention the size because my initial idea for fixing the HTML representation time was to just pre-create the HTML of the areas and save it to git and detect when it needs to be updated.

https://satpy.readthedocs.io/en/stable/resample.html#area-definitions-included-in-satpy

Here's the timing on time make html on my local machine with the cartopy shapefiles already downlaoded and cached:

Before generating reader table:  2024-07-30T14:52:17.541998
After generating reader table:  2024-07-30T14:52:18.705411
After reading area YAML:  2024-07-30T14:52:18.783992
Before iterating over areas:  2024-07-30T14:52:18.784044
After iterating over areas:  2024-07-30T14:52:18.970641
After writing areas:  2024-07-30T14:56:28.425944

real    6m57.167s
user    6m55.319s
sys 0m2.993s

So the last two prints before the time output are the HTML representation writing. So ~66% of the documentation rendering is the area repr. Another big chunk of the remaining 3 minutes is likely API documentation generation.

djhoese commented 4 months ago

As expected if I comment out the html += map_section(area) portion of the HTML formatting in pyresample the HTML list of areas is generated in less than a second. So purely cartopy map generation here.

sfinkens commented 4 months ago

Is it possible to generate those maps in parallel?

djhoese commented 4 months ago

Good idea. Probably. Right now this all happens as a side effect of sphinx loading the conf.py configuration file. To make it a little cleaner with fewer unintended consequences we should probably put this in a sphinx extension module to be loaded by sphinx specifically. I've done this for "roles" used in the .rst files, but never any of the other types of extensions. Actually, now that I say that I wonder if sphinx has a parallel build plugin available...

djhoese commented 4 months ago

Ah there is a -j option to specify sphinx number of jobs.

djhoese commented 3 months ago

Just out of curiosity I started looking at what the cartopy/matplotlib code is doing and wanted to see if it could be sped up itself. First I thought of dpi to make lower resolution images, but we're saving to SVG so that doesn't work. Then I looked at what features were added in the cartopy steps and it is OCEAN, LAND, COASTLINE, and BORDERS. Just to see, I took out the first 3. Although the images are not useful without coastlines, it made the overall documentation generation on my laptop go from 7m17s to 2m24s. I wonder which step is taking the longest. My guess would be the land and ocean steps. Let's see...

Edit: 2m43s with only coastlines and borders and images are much more usable.

Edit 2: 4m51s with coastlines, borders, and oceans (no land).

mraspaud commented 3 months ago

From our discussion with @BENR0 when he implemented this, the Ocean is problematic for at least the GEOS projection. So I would suggest removing ocean for now.

djhoese commented 3 months ago

@mraspaud Ok. I thought he fixed that in one of the recent pyresample PRs though?

Regardless, I did some experiments this weekend with geoviews and bokeh (the backend that geoviews can use). My solution isn't complete yet and is very hacky, but it takes ~3 minutes to render the whole documentation and produces javascript maps instead of SVGs. The maps could be interactive, but I don't think it makes sense for them to be. I might push my work to a branch and leave it alone for a while, but I'd also like to post on the geoviews or bokeh discussion forums and ask for advice on how best to implement what we want. Specifically we have an HTML template for each area that we want to slip into a restructuredtext document (section headers, etc) in our sphinx documentation and where and how files are included gets complicated.

djhoese commented 3 months ago

@BENR0 @mraspaud I'm not sure if I should open a PR in pyresample to discuss this, but since I don't have a PR I thought I'd bring it up here. So as mentioned above I have a hacky bokeh version of the maps working, but it doesn't include all the attribute and HTML work done in the current functionality in pyresample. Additionally, adding the bokeh functionality to pyresample would require restructuring how things work. As an alternative, Martin had suggested just turning off OCEAN features in the maps to speed things up, but this would require a new pyresample release anyway to either disable it or add the necessary kwargs to allow for controlling what features are plotted (not currently available). So I'm not sure what effort I should put my limited amount of time toward.

For further details on the bokeh refactoring needed: The best option I've found for bokeh is to basically do this:

for aname, params in area_dict.items():
    params = area_dict[aname]
    area_def = _create_area_def_from_dict(aname, params)
    if not hasattr(area_def, "to_cartopy_crs"):
        continue
    crs = area_def.to_cartopy_crs()

    features = gv.Overlay([gf.ocean, gf.land, gf.borders, gf.coastline])
    f = gv.render(features.opts(toolbar=None, default_tools=[], projection=crs, xlim=crs.bounds[:2], ylim=crs.bounds[2:]), backend="bokeh")
    areas_bokeh_models[aname] = f

script, divs_dict = components(areas_bokeh_models)

Where the script in the last line is a single large javascript object with all the plot information and should be placed in the HTMl. The divs_dict is a dictionary of area name -> HTML "

" strings for each map. So the main difference to the existing pyresample interface is that I think it performs best to process all areas at once rather than individual (area_repr in pyresample/_formatting_html.py).

This all ignores the fact that really pyresample should probably be including a sphinx extension to generate these things as a new restructuredtext role/directive. For example:

.. pyresample-geometry::
   :area_file: areas.yaml
   :area_name: msg_something_something
   :backend: bokeh
simonrp84 commented 3 months ago

I see the code is doing things like: ax.add_feature(cartopy.feature.OCEAN)

Would it help to reduce the resolution? ax.add_feature(cartopy.feature.OCEAN.with_scale("110m")) I'm not sure what resolution is used by cartopy as the default

djhoese commented 3 months ago

As far as I could tell that is the default. I had the same hope.

djhoese commented 3 months ago

I've made some major changes to my #2875 with bokeh plotting. It requires the pyresample PR to be released though (https://github.com/pytroll/pyresample/pull/615).