pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.08k stars 298 forks source link

Print filenames generated by writers to stdout #419

Open djhoese opened 6 years ago

djhoese commented 6 years ago

Is your feature request related to a problem? Please describe. This is something that @rayg-ssec has been emphasizing the last couple months (years?) for best practice and reusability of software/scripts. If I want to run my script and have the created files used by another script I probably have to resort to glob'ing a directory (assuming I created a new directory just for this output).

Describe the solution you'd like Somewhere in the writer system I think we could have writers print print(filename) for each file they create. This way another script (bash, etc) can parse this output and operate on the files generated. Some questions:

  1. Full paths or just the filename. I suppose full paths would be best and most useable (@rayg-ssec?)
  2. Print when the file is fully generated and complete or when the filename is first determined? Probably fully generated although this may be harder to code with dask, especially if users delay computations.

Describe any changes to existing user workflow Overall this shouldn't require changes to normal user workflow. This may be strange

Additional python or other dependencies None.

Describe any changes required to the build process None.

Describe alternatives you've considered We used to handle this type of thing by having writers return a list of filenames that were created, but now that we are using dask we have to return the Delayed objects or source/target objects. I'm not sure it would be possible to return anything else and make the interface "simple". There is also the option to make this print out optional (default to True?).

pnuu commented 6 years ago

Traditionally PyTroll packages have communicated the wanted things via posttroll messages. I see that this can be "too heavyweight" for non-operational stuff, and I guess won't create anything hazardous in for message based communication.

Another choice that came to my mind was that the writers would return the filenames so that the user could print them if wanted, but that doesn't work with lazy operations.

djhoese commented 6 years ago

How to the posttroll messages get configured and then sent? I believe @rayg-ssec has something similar that I think follows the general *nix style of simple single-purpose command line tools where you just pipe information in to a separate script. As a completely made up example:

python my_satpy_script.py | geotiff_update --dest 192.168.1.2 --platform NPP --sensor VIIRS

Another choice that came to my mind was that the writers would return the filenames so that the user could print them if wanted, but that doesn't work with lazy operations.

Yes exactly. I'm not sure how to modify the dask usage in the writers and get the same behaviors. We could specify a callback option but I'm not a huge fan of that either. It seems un-bash-like to deal with callbacks, but it seems un-pythonic to print stuff to stdout that the user can't access from within python.

pnuu commented 6 years ago

Typically the messaging is configured in the production system, in our case trollflow with trollflow-sat plugins, or earlier in trollduction (only mpop as data interface). Yeah, no real documentation, just example configs...

Our whole workflow is based on messaging: messages generated for each new file (pyinotify, polling or directly from direct-readout) and possibly collected together to cover target area (granulated and/or segmented data) by pytroll-collectors, then to trollflow, which again can send messages for further processing steps.

gerritholl commented 3 years ago

Alternatives to writing to stdout:

I think the filenames are or can be calculated without computing the data, even when the actual writing is postponed, correct?

And for operational use through trollflow2, there apparently already is a solution, because trollflow2 can broadcast the filenames with the FilePublisher plugin?

djhoese commented 3 years ago

To me, especially since you can write multiple times with multiple calls to save_dataset/s, the first option of storing it the Scene should be avoided. It puts some weird stateful-ness in the Scene.

I think returning the filenames is a good idea and maybe something that needs to be added. It adds multiple possible return values and cases to save_dataset/s and will likely require changes to the writers, but it is the most straight forward out of the possible options and is more pythonic even if multiple return values have a weird code smell. Maybe we create a WriterResult class or namedtuple type thing that handles all of the logic for this. It could contain the filename, a source dask array, a target array-like object for da.store, or a dask delayed object to be computed. A problem might arise though with each writer having control over what a delayed object does or even where one target array-like object is actually writing each chunk to a separate file. In these cases I guess a WriterResult object would have to allow for multiple filenames. I suppose there would also have to be handling or at least an expectation from the user that filenames are not guaranteed to be included.

Regarding calculating the filenames, yes it can be done without the data being computed. However, it is writer dependent what these filenames look like so the writers still need to be the ones generating them. And it isn't even as simple as calling each writer's get_filename as some writers (ex. awips_tiled) will generate the filename pattern and format values after doing some computation and that work shouldn't be duplicated.

For trollflow2, my guess based on past conversations with @pnuu is that it is using inotify to "detect" when new files are created for writing.