How tasks decide which spectra they will analyse

Some tasks run on all spectra, and some should only run on some kinds of spectra. In the narrowest case, CORV should only run on BOSS spectra assigned to the white dwarf carton, where SnowWhite has already classified that spectrum as a DA-type. In the broadest case, BOSSNet runs on any BOSS spectrum.

In the past I would specify this by setting defaults for the spectra argument in a task. For example:

spectra: Optional[Iterable[BossVisitSpectrum]] = (
        BossVisitSpectrum
        .select()
        .join(SnowWhite, on=(BossVisitSpectrum.spectrum_pk == SnowWhite.spectrum_pk))
        .switch(BossVisitSpectrum)
        .join(
            Corv,
            JOIN.LEFT_OUTER, 
            on=(
                (BossVisitSpectrum.spectrum_pk == Corv.spectrum_pk)
            &   (Corv.v_astra == __version__)
            )
        )
        .where(
            Corv.spectrum_pk.is_null()
        &   (SnowWhite.classification == "DA")
        )
    ),

This is problematic for a few reasons:

It specifies that the default should be BossVisitSpectrum type, but in reality the task could take co-added spectra or visit spectra. Specifying the spectrum type in the default argument means that if we want to use a non-default type, we need to supply the classification constraints ourselves.
When we call a task from the astra command line tool, we can easily scale a task across many nodes and processes. If we know what task we want to run, and what spectrum type to run it on (e.g., BossVisitSpectrum) then a 0th order balancing of this is easy: we simply get the number of spectra that we need to analyse, then paginate the SQL query across each node or processor. But that means that the astra CLI task needs to know whatever constraints the task has about what spectra it will, or will not accept. Otherwise it can't construct the query and paginate it efficiently across nodes and processes.

The requirements are:

The task should be able to specify what types of spectra it will accept.
It should also specify which additional tables are REQUIRED for the task. Otherwise this becomes a messy join. If we hand it a random spectrum or list of spectra, then it's the user's fault to make sure it has the right attributes.
In the case of CORV, if it gets a spectrum which is not classified by SnowWhite as DA, then it should still yield a Corv result, but it should have a flag indicating something like "not analysed because X". That means for every row in BossVisitSpectrum there will be a row in the pipeline table, even if that row contains nulls and a flag. Then when we create output files we can simply restrict rows to where the pipeline actually attempted an analysis.
Related to above: pipeline tables will soon have modified fields, and a unique index on (v_astra, spectrum_pk) and the @task wrapper just updates the existing record.
The astra CLI will look for spectra that are either not already in the pipeline table, or have been modified more recently than the pipeline table modified field.
Usually we check for "new things we have to run" by checking the modified field of a spectrum, or to see if there are spectra that are not yet in the pipeline table. But for CORV, it's outputs depend on what SnowWhite says. So if SnowWhite first says DZ, then Corv doesn't analyse it, we would need to look for the SnowWhite.modified field (not the spectrum modified field) to check for any new analyses. This is a pretty edge case, where maybe we should just say if SnowWhite re-runs things, we should clear subsequent CORV analyses. One way we could do that is by having a field in CORV that links to the SnowWhite row, and if we delete the SnowWhite row then we cascade it to CORV. But this requires some more thought.

The new task interface might look something like:

@task(
  spectrum_models=(
    BossVisitSpectrum, 
    BossRestFrameVisitSpectrum, 
    BossCombinedSpectrum
  ),
  inner_join=MappingProxyType({"snow_white": SnowWhite})
)
def corv(spectra: Iterable[Spectrum]):

  for spectrum in spectra:
    # We could check for spectrum.source.assigned_to_program("mwm_wd") but SnowWhite would have done that
    if (spectrum.snow_white.classification == "DA"):
      ...
    else:
      yield Corv.from_spectrum(spectrum, flag_not_processed=True)

The benefits are:

No limit or page handling within the task.
No ModelSelect logic within the task.
We keep a 1-to-1 record of spectrum rows and pipeline rows, even if it adds table bloat, it prevents duplicates.
If we supply no spectrum type to the astra CLI, it can figure out which spectra it should run, and distribute across all nodes/procs.

The downsides are:

If you just want to run things interactively, you better make sure you give the spectra with the appropriate attrs (e.g., snow_white). This is already a problem in the current set up. We could mitigate this by having something like:
```
generate_task_runs(corv, overwrite=False, limit=10)
```
which would look at the task definition, see what spectra it accepts, create the queries with the necessary joins, only analyse new spectra (overwrite=False) or analyse everything (overwrite=True) and yields a query you can provide to corv.

sdss / astra

How tasks decide which spectra they will analyse #23