Re-implement Orso file writing

jokasimr commented 8 months ago

The Sciline version of the workflow does not do the Orso file writing that was done previously.

A Orso-file provider should be added to allow users to request a Orso file describing the steps of the workflow.

SimonHeybrock commented 5 months ago

Next step:

Provide a summary (here) of all the information that needs to go into the Orso.
Compare to scipp/scippneutron#473
Once available, we can sit down and decide on a solution to implement.

jl-wynen commented 5 months ago

Enumerating all information in detail here makes little sense because there is a lot. See https://orsopy.readthedocs.io/en/latest/modules.html

But here is a high level overview. (As far as ORSO is concerned, everything is optional.)

Owner and creator: Owner of the (input) data and creator of the reduced data. We can potentially get the owner from the input data file / SciCat dataset. But the creator must be provided by the user.
Experiment: We can automatically populate most/all fields from NeXus, SciCat, or the fact that the code is using essreflectometry (e.g. infer that the probe is 'neutron').
Sample: Can maybe get it from NeXus / SciCat, but not sure whether they will have a compatible format.
Measurement: Can automatically set data filenames and, with a bit of pipeline design, the instrument settings.
Reduction: Mixed, some trivial to get from the python package, some is user input (creator), and the list of corrections can maybe be extracted from the pipeline. So we would have to either tag providers so that they can be recognised and listed in the corrections, or build a list of corrections on the fly like the old ess.amor code does.

I think the only technical challenge is the very last part.

jl-wynen commented 5 months ago

To clarify what I mean by tagging providers, here is an example:


RawData = NewType('RawData', int)
CorrectedData = NewType('CorrectedData', int)
ReducedData = NewType('ReducedData', int)

def tag(*tags):
    def impl(func):
        func.__tags__ = tags
        return func
    return impl

def load() -> RawData:
    return RawData(3)

@tag({"correction": "correct something"})
def correct(raw: RawData) -> CorrectedData:
    return CorrectedData(raw // 2)

def reduce(corrected: CorrectedData) -> ReducedData:
    return ReducedData(corrected + 3)

pl = sciline.Pipeline((load, correct, reduce))
tg = pl.get(ReducedData)
g = tg._graph
for p, *_ in g.values():
    print(p, getattr(p, '__tags__', ()))

I.e., the correct function is tagged as a 'correction' which can be picked up by the code that builds the ORSO object.

This is pretty generic and doesn't interfere with Sciline. But we may need a better name than __tags__.

Without the tag, we couldn't tell from the pipeline alone what to put into orso.reduction.corrections. Alternatively, we could hard code which functions count as corrections and search for them in the graph. But that seems too inflexible.

SimonHeybrock commented 5 months ago

To clarify: I meant to enumerate everything that was implemented in the ess package.

jl-wynen commented 5 months ago

To clarify: I meant to enumerate everything that was implemented in the ess package.

In the notebook:

owner = fileio.base.Person(
    'Jochen Stahn', 'Paul Scherrer Institut', 'jochen.stahn@psi.ch'
)
sample = fileio.data_source.Sample(
    'Ni/Ti Multilayer', 'gas/solid', 'air | (Ni | Ti) * 5 | Si'
)
creator = fileio.base.Person(
    'Andrew R. McCluskey', 'European Spallation Source', 'andrew.mccluskey@ess.eu'
)

orso = make_orso(
    owner=owner,
    sample=sample,
    creator=creator,
    reduction_script='https://github.com/scipp/ess/blob/main/docs/instruments/amor/amor_reduction.ipynb',
)

In the Amor module:

orso = fileio.orso.Orso.empty()
orso.data_source.experiment.probe = 'neutrons'
orso.data_source.experiment.facility = 'Paul Scherrer Institut'
orso.data_source.measurement.scheme = 'angle- and energy-dispersive'
orso.reduction.software = fileio.reduction.Software(
    'scipp-ess', __version__, platform.platform()
)
orso.reduction.timestep = datetime.now()
orso.reduction.corrections = []
orso.reduction.computer = platform.node()
orso.columns = [
    fileio.base.Column('Qz', '1/angstrom', 'wavevector transfer'),
    fileio.base.Column('R', None, 'reflectivity'),
    fileio.base.Column('sR', None, 'standard deivation of reflectivity'),
    fileio.base.Column(
        'sQz', '1/angstrom', 'standard deviation of wavevector transfer resolution'
    ),
]
orso.data_source.owner = owner
orso.data_source.sample = sample
orso.reduction.creator = creator
orso.reduction.script = reduction_script

And it automatically sets corrections and wavelength and scattering angle ranges during the workflow.

SimonHeybrock commented 5 months ago

And it automatically sets corrections and wavelength and scattering angle ranges during the workflow.

'It' being the workflow, 'setting' things in the Orso? Which corrections?

jl-wynen commented 5 months ago

The various functions used in the workflow set orso fields via the attrs of the data arrays. So the workflow itself, as far as the user is concerned, doesn't have to handle metadata, the metadata is tracked automatically.

Rhe Reduction field of an ORSO header contains a list of 'corrections'. I don't know exactly what would be included in that. But the current workflow lists things like supermirror normalisation or monitor tof correction.

SimonHeybrock commented 5 months ago

So the workflow itself, as far as the user is concerned, doesn't have to handle metadata, the metadata is tracked automatically.

In the old implementation, yes. I think the plan here was to reconsider that. That is why we would like a list of things that are written by the old workflow.

jl-wynen commented 5 months ago

So the workflow itself, as far as the user is concerned, doesn't have to handle metadata, the metadata is tracked automatically.

In the old implementation, yes. I think the plan here was to reconsider that. That is why we would like a list of things that are written by the old workflow.

:point_down:

To clarify: I meant to enumerate everything that was implemented in the ess package.

In the notebook:

owner = fileio.base.Person(
    'Jochen Stahn', 'Paul Scherrer Institut', 'jochen.stahn@psi.ch'
)
sample = fileio.data_source.Sample(
    'Ni/Ti Multilayer', 'gas/solid', 'air | (Ni | Ti) * 5 | Si'
)
creator = fileio.base.Person(
    'Andrew R. McCluskey', 'European Spallation Source', 'andrew.mccluskey@ess.eu'
)

orso = make_orso(
    owner=owner,
    sample=sample,
    creator=creator,
    reduction_script='https://github.com/scipp/ess/blob/main/docs/instruments/amor/amor_reduction.ipynb',
)

In the Amor module:

orso = fileio.orso.Orso.empty()
orso.data_source.experiment.probe = 'neutrons'
orso.data_source.experiment.facility = 'Paul Scherrer Institut'
orso.data_source.measurement.scheme = 'angle- and energy-dispersive'
orso.reduction.software = fileio.reduction.Software(
    'scipp-ess', __version__, platform.platform()
)
orso.reduction.timestep = datetime.now()
orso.reduction.corrections = []
orso.reduction.computer = platform.node()
orso.columns = [
    fileio.base.Column('Qz', '1/angstrom', 'wavevector transfer'),
    fileio.base.Column('R', None, 'reflectivity'),
    fileio.base.Column('sR', None, 'standard deivation of reflectivity'),
    fileio.base.Column(
        'sQz', '1/angstrom', 'standard deviation of wavevector transfer resolution'
    ),
]
orso.data_source.owner = owner
orso.data_source.sample = sample
orso.reduction.creator = creator
orso.reduction.script = reduction_script

Plus these in the various functions in reflectometry and amor:

reduction.corrections += ['supermirror calibration']
reduction.corrections += ['chopper ToF correction']
reduction.corrections += ['gravity correction']
reduction.corrections += ['footprint correction']
reduction.corrections += ['total counts']
instrument_settings.wavelength = fileio.base.ValueRange(
            float(data_array_wav.coords['wavelength'].min().value),
            float(data_array_wav.coords['wavelength'].max().value),
            unit,
        )
instrument_settings.incident_angle = fileio.base.ValueRange(
            float(data_array_theta.coords['theta'].min().value),
            float(data_array_theta.coords['theta'].max().value),
            data_array_theta.bins.coords['theta'].min().unit,
        )

Plus these in the amor notebook:

data_source.measurement.comment = 'Pixel positions corrected'

Plus these in the offspec reduction notebook:

reduction.corrections += ['region of interest defined as spectrum 389:415']
reduction.corrections += ['monitor background subtraction, background above 15 Å']
reduction.corrections += ['normalisation by summed monitor'']
reduction.corrections += ['normalisation by direct beam']
data_source.measurement.instrument_settings.wavelength = fileio.base.ValueRange(min=float(wavelength.min), max=float(wavelength.max), unit=wavelength.unit)
data_source.measurement.instrument_settings.incident_angle = fileio.base.Value(magnitude=float(incident_angle.magnitude), unit=incident_angle.unit)

And, or course, the actual data.

SimonHeybrock commented 5 months ago

For the corrections, how about adding a provider with optional dependencies, which "detects" the presence of certain input parameters or intermediates:

def detect_correction(footprint: Optional[FootprintCorrectedData[Sample]) -> OrsoCorrections:
    if footprint is not None:
        ...

Now, technically the presence of FootprintCorrectedData[Sample] in the pipeline does not guarantee that it is in the path that leads to the reduced data (for that you would actually need to analyze the task graph). But is that actually a problem in practice, if we setup the pipeline adequately?

jl-wynen commented 5 months ago

The most important argument against that is that this approach checks values, not providers. So, let's say, the user disables the footprint correction. They would still have the FootprintCorrectedData[Sample] in the graph because that is a dependency of other providers. But they would use a provider for it that doesn't do the correction. Your approach would not be able to detect that.

Less importantly, I wanted to avoid having a central place that defines what counts as a correction. If we are willing to have such a place, we don't need a special provider. We can just walk the graph and check for each provider whether it is in the list of corrections. This would be simpler than having to check for the special detect_correction provider.

SimonHeybrock commented 5 months ago

The most important argument against that is that this approach checks values, not providers. So, let's say, the user disables the footprint correction. They would still have the FootprintCorrectedData[Sample] in the graph because that is a dependency of other providers. But they would use a provider for it that doesn't do the correction. Your approach would not be able to detect that.

I don't see what you are saying. If you have a provider that claims to do a footprint correction but does not, there is nothing you can do about that, even if you look at the task graph. In the old approach, if you pass the Orso along with the data, if the provider that does not do a footprint correction still writes that to to Orso you have the same problem. We should avoid providers claiming to return something that is a lie.

Less importantly, I wanted to avoid having a central place that defines what counts as a correction. If we are willing to have such a place, we don't need a special provider. We can just walk the graph and check for each provider whether it is in the list of corrections. This would be simpler than having to check for the special detect_correction provider.

Agreed, we should analyze the task graph, and combine that with the remaining Orso bits that were made using Sciline.

jl-wynen commented 5 months ago

I don't see what you are saying. If you have a provider that claims to do a footprint correction but does not, there is nothing you can do about that, even if you look at the task graph. In the old approach, if you pass the Orso along with the data, if the provider that does not do a footprint correction still writes that to to Orso you have the same problem. We should avoid providers claiming to return something that is a lie.

Unfortunately, in the current system, what a provider claims to do (via its return type) is dictated by the surrounding pipeline. Let's say, again, that we want to disable the footprint correction. We have two options here. 1) replace the footprint correction provider to do nothing but still return a FootprintCorrectedData, or 2) replace all providers that consume FootprintCorrectedData to request something else. Option 1 would potentially be much simpler to do if there are multiple consumers of the corrected data. But in this case, there is a mismatch between the 'claim' of the fake footprint correction provider and what it actually does. We can discourage this all we like, but it will still happen.

An analysis of the keys in the graph cannot detect this case. Only an analysis of the providers can. That is, the provider name, id, or other metadata, but not its return annotation. Hence my suggestion of tagging providers.

In the old approach, if you pass the Orso along with the data, if the provider that does not do a footprint correction still writes that to to Orso you have the same problem.

Correct. But then, the provider author explicitly specified that the provider counts as a correction. And this is not tied to the environment the provider is used in (the pipeline keys). With your proposal, it is not the author of the provider but the author of the package or pipeline that specifies this. I much prefer an explicit solution that is tied directly to the provider.

SimonHeybrock commented 5 months ago

I don't see what you are saying. If you have a provider that claims to do a footprint correction but does not, there is nothing you can do about that, even if you look at the task graph. In the old approach, if you pass the Orso along with the data, if the provider that does not do a footprint correction still writes that to to Orso you have the same problem. We should avoid providers claiming to return something that is a lie.

Unfortunately, in the current system, what a provider claims to do (via its return type) is dictated by the surrounding pipeline. Let's say, again, that we want to disable the footprint correction. We have two options here. 1) replace the footprint correction provider to do nothing but still return a FootprintCorrectedData,

My claim is that this is a bad idea.

or 2) replace all providers that consume FootprintCorrectedData to request something else. Option 1 would potentially be much simpler to do if there are multiple consumers of the corrected data. But in this case, there is a mismatch between the 'claim' of the fake footprint correction provider and what it actually does. We can discourage this all we like, but it will still happen.

That sounds a bit hypothetical. At least currently it seems that our team is writing 90+% of the workflows and providers.

An analysis of the keys in the graph cannot detect this case. Only an analysis of the providers can. That is, the provider name, id, or other metadata, but not its return annotation. Hence my suggestion of tagging providers.

I don't see how that doesn't have the same (or a very similar problem) problem. If you look for a provider in a certain module to "tell" whether a correction has been applied, someone can still come and modify the provider.

jl-wynen commented 5 months ago

My claim is that this is a bad idea.

That sounds a bit hypothetical. At least currently it seems that our team is writing 90+% of the workflows and providers.

This won't scale though. We cannot control all workflows that will be used. In particular, I'm concerned with modifications to workflows, not the base workflows provided by ESS. People will change things with no regard for recommendations or established best practices. And if the metadata doesn't reflect those changes, it is useless.

I don't see how that doesn't have the same (or a very similar problem) problem. If you look for a provider in a certain module to "tell" whether a correction has been applied, someone can still come and modify the provider.

Let's say we have

@provider(metadata={'correction': 'footprint correction'})
def footprint_correction(data: Data) -> FootprintCorrectedData:
    # do work
    return ...

If someone comes in and modifies this function to not do a footprint correction but leaves the metadata as it is, this is entirely on them. Everything they need to know and modify is right there. But if the provider decorator wasn't there, they would have to know to remove footprint_correction or FootprintCorrectedData from the list of corrections in a completely different place in the code.

Again, this is about being explicit, right at the place where it matters, that this provider is counted as a correction.

YooSunYoung commented 5 months ago

Since it might be general issues after all, I have a question about this specific case of NMX about logging steps / collecting metadata,

NMX reduction has some hard-coded arbitrary manipulation, they are done in the loading provider, so they were not directly accessible from the workflow pipeline. But then it was not so easy to access to the documentation of each step, unless you open up the code.

So current solution was to,

type-hint the fields of NMXData(sc.DataGroup) as properties,
added the functions that were used for those properties into the documentation
and added the link to the documentation page in the loader function.

It might be more clear to see load_mcstas_nexus in mcstas_loader. And I still couldn't find the best way of storing/documenting the actual values of arbitrary scale factors.

In summary I would like to:

automatically collect some scale factors or arbitrary numbers that were used as metadata
easily access to how the instances of specific types are derived, even if they are not part of the graph

However, these are mostly about documenting potentially-surprising manipulations, not for storing them in a nice way like Orso. Regarding reproducibility, it's probably enough to export the graph along with the versions or git commit id for NMX...

jl-wynen commented 5 months ago

The remainder of this issue is not about storage but only about accessing the relevant information. So your comment fits well.

But why can't you split the providers into smaller providers where some of them are the steps you want to track? And magic numbers can be turned into parameters. Possibly default parameters implemented in ess.nmx.

YooSunYoung commented 5 months ago

But why can't you split the providers into smaller providers where some of them are the steps you want to track? And magic numbers can be turned into parameters. Possibly default parameters implemented in ess.nmx.

Because most of those numbers(scale factors) are only for McStas, and we didn't want to expose them as part of workflow. Also, some of the steps should be done while the file is open. So if we split them into multiple providers, the input file needs to be opened/read/closed for each provider, which didn't make any sense for NMX...

YooSunYoung commented 5 months ago

I came up with the solution for the internally called smaller steps cases...! I was ignorant... : D.... I think we already have talked about this before... Here is the solution I came up with: https://github.com/scipp/essnmx/pull/21

If sciline supports freezing workflow on top of this, I think NMX reduction doesn't need more fancy mechanism of logging or tracking for now.

jl-wynen commented 4 months ago

Blocked until the next release of Sciline.

scipp / essreflectometry

Re-implement Orso file writing #6