scipp / essreflectometry

Reflectometry data reduction for the European Spallation Source
https://scipp.github.io/essreflectometry/
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Workflow guidelines in amor reduction #48

Closed nvaytet closed 3 weeks ago

nvaytet commented 1 month ago

Making a list of things to do before I forget:

  1. Use an AmorWorkflow with default parameters and providers, instead of talking about providers in the notebook.

  2. Naming (in accordance with SANS and diffraction workflows):

    • Run -> RunType
    • Sample -> SampleRun
    • What to do about FilePath vs dedicated functions to provide file names
    • RawDetector -> LoadedNeXusDetector ?
    • RawEvents -> RawDetectorData
    • ChopperCorrectedTofEvents -> ReducibleDetectorData ?
SimonHeybrock commented 1 month ago

What to do about FilePath vs dedicated functions to provide file names

FilePath should be removed, see latest version of guidelines (not in published docs yet).

jokasimr commented 4 weeks ago

I think the guidelines are a bit unclear on the FilePath point. It doesn't mention file paths much, but it mentions Filename can be used for names or paths which I think is confusing.

Ultimately any file identifier (be it a repository + name, a hash, a Scicat ID or whatever) will be resolved to a file path in the file system, in my mind it's natural to call that path a FilePath and not a Filename.

It's also easier if we make FilePath means only one thing, a path on the file system, it doesn't mean "either a local file or a file identifier or something else".

jokasimr commented 4 weeks ago

ChopperCorrectedTofEvents -> ReducibleDetectorData

I'm not fond of the ReducibleDetectorData. It seems more complicated than the old name. I agree the old name is bad, but how about TofDetectorData?

Generally I don't think it's good to name anything ...Data because it just doesn't communicate much (what isn't data?). In this context I'd prefer ...Events or ...EventData to make it clear we are talking about a specific kind of data.

jokasimr commented 4 weeks ago

Alternatively we just get rid of the ChopperCorrectedTofEvents altoghether and make the frame unwrapping part of the add_coordinates function.

The EventData contains all data from before the frame unwrapping anyway right? Or should we drop the event_time_offsets after computing tof?

nvaytet commented 4 weeks ago

@jokasimr

The ChopperCorrectedTof* felt quite Amor-specific to me. Also, not sure "Chopper" needs to be in the name. ReducibleDetectorData actually implies more than just contains correct tof values, it also implies (in other workflows) that all coordinates are there to be able to proceed with the reduction (sample position, source position, etc.)

It's just what it's called in other workflows, so I would like to standardize it all, instead of having each package have its own naming.

About the Events part, I don't know if it's good to be explicit that at this point we have events (but then everything should be called *Events), or remain general for cases where we may have histogrammed data early on (we don't have examples of this for now, but maybe we will at some point?). We may also have some workflows that are required to work both with event and histogrammed data (I think it is still unclear if all monitors will record events or not, so it would be difficult to name something IncidentMonitorEvents).

In addition, as an example, say I have events that are binned in pixels, and then I apply masks on the pixels. Currently, such data is named MaskedData, and changing this to MaskedEvents may be a little misleading as it would suggest the masks are on the events?

I do get the point about

what isn't data?

nvaytet commented 4 weeks ago

FilePath should be removed, see latest version of guidelines (not in published docs yet).

@SimonHeybrock this was after a discussion I had with @jokasimr where we discussed how to handle different sources for files. Previously, we had settled on just having a Filename and then if we want something for a tutorial data, we called a special function which would give us the (pooch) path to the file.

His alternative solution was to have something like

So then, depending on which one you set on your workflow/params, it would pick the appropriate provider. It's quite a nice solution, but it means you have to remember which one you need to set, instead of having just one name to remember. It does remove the need for creating a function per file for tutorial files, but there was some argument (in another discussion) in favor of having these functions, as they avoided hard-coding file names or run numbers in tests and docs.

I guess the question is, what was our plan for handling Scicat files with our current solution?

jokasimr commented 4 weeks ago
  • LocalFilename for local files, with a provider than converts LocalFilename to a FilePath

This seems unnecessary since the "conversion" would be just the identity function?

SimonHeybrock commented 4 weeks ago

FilePath should be removed, see latest version of guidelines (not in published docs yet).

@SimonHeybrock this was after a discussion I had with @jokasimr where we discussed how to handle different sources for files. Previously, we had settled on just having a Filename and then if we want something for a tutorial data, we called a special function which would give us the (pooch) path to the file.

His alternative solution was to have something like

* `LocalFilename` for local files, with a provider than converts `LocalFilename` to a `FilePath`

* `TutorialFilename` for tutorial (pooch) files with a provider that would also yield `FilePath`

* `ScicatFilename` for scicat files, with provider `ScicatFilename` -> `FilePath`

So then, depending on which one you set on your workflow/params, it would pick the appropriate provider. It's quite a nice solution, but it means you have to remember which one you need to set, instead of having just one name to remember. It does remove the need for creating a function per file for tutorial files, but there was some argument (in another discussion) in favor of having these functions, as they avoided hard-coding file names or run numbers in tests and docs.

I guess the question is, what was our plan for handling Scicat files with our current solution?

What is wrong with the solution in ESSsans and the Guidelines (on main)?

Scicat files could be handled like tutorial files, e.g., as here: https://scipp.github.io/esssans/user-guide/isis/zoom.html#Configuring-data-to-load. But no ultimate decision has been made.

The solution with having multiple filename types seems very invasive and non-obvious? It sounds like it is going back to the old approach, which was confusing.

SimonHeybrock commented 4 weeks ago

@jokasimr Here is the conclusion from the previous discussion on this topic, based on which the decision for this approach was made: https://github.com/scipp/esssans/pull/136#discussion_r1587527172.

jokasimr commented 4 weeks ago

What is wrong with the solution in ESSsans and the Guidelines (on main)?

With the approach @nvaytet suggested we get parallel file downloads for free with dask as part of the pipeline, that's quite nice. Another benefit is that we're working in the declarative framework of sciline instead of stepping outside of it. It also seems quite verbose to create a function for each file you're using in the tutorial workflows, but that's maybe less important. I still think Filename is a bad name for what is actually a file path.

The solution with having multiple filename types seems very invasive and non-obvious? It sounds like it is going back to the old approach, which was confusing.

Wasn't the old approach confusing because it tried to hide the difference between local files and remote files? This approach doesn't do that.

If it has been decided we do it in the other way (the one currently in the guidelines) then I'm fine with that.

SimonHeybrock commented 4 weeks ago

What is wrong with the solution in ESSsans and the Guidelines (on main)?

With the approach @nvaytet suggested we get parallel file downloads for free

Not clear this is positive? It also means side-effects, potential race-conditions, ...

Another benefit is that we're working in the declarative framework of sciline instead of stepping outside of it. It also seems quite verbose to create a function for each file you're using in the tutorial workflows, but that's maybe less important.

... but you need to write multiple providers, and modify the workflow (by inserting different providers) depending on the file sources?

The solution with having multiple filename types seems very invasive and non-obvious? It sounds like it is going back to the old approach, which was confusing.

Wasn't the old approach confusing because it tried to hide the difference between local files and remote files? This approach doesn't do that.

Maybe I do not understand what you suggest. If you have multiple filename types, I suppose you need to make dedicated providers for each, and then let the user (or the workflow wrapper?) insert the correct ones?

If it has been decided we do it in the other way (the one currently in the guidelines) then I'm fine with that.

Great 👍

jokasimr commented 4 weeks ago

Not clear this is positive? It also means side-effects, potential race-conditions, ...

Not sure that would be a problem in practice if we design the file download providers sanely. And the benefits of parallel file downloads can be quite large.

... but you need to write multiple providers,

Yes, one provider per file source. But not one per file.

Maybe I do not understand what you suggest. If you have multiple filename types, I suppose you need to make dedicated providers for each, and then let the user (or the workflow wrapper?) insert the correct ones?

I'd suggest to insert all providers from the start. Then the user doesn't have to insert the correct ones. The correct one will be used based on the parameters the user has provided.

jokasimr commented 4 weeks ago

@nvaytet pointed out the misunderstanding here. I was thinking a pipeline could have multiple providers producing the same type of output, but that's of course not the case and it never has been. Sorry about the confusion. The user would have to insert the right one manually.