irfu / irfu-matlab

Matlab routines to work with space data, particularly with MMS and Cluster/CAA data. Also some general plasma routines.
56 stars 46 forks source link

BICAS: Update datasets to archiving standards #146

Open ErikPGJ opened 2 weeks ago

ErikPGJ commented 2 weeks ago

ROC & SOC wants to update datasets generated by BICAS w.r.t. to metadata standards.

Exactly what this entails is a bit uncertain to me at the time of writing, but I assume it should at least include below:

Note: Xavier Bonnin/ROC offers some kind of validation script which one can use: "check_rpw_cdf.py" See e-mail, Xavier Bonnin, 2024-06-12: "[roc.rcs] New release of check_rpw_cdf.py script"

A new release of the check_rpw_cdf.py script has just been committed on the "roc" branch of DataPool (see commit 6ea9e0bc06e262c65da18a8bea9611eb5becdafa).

ErikPGJ commented 2 weeks ago

Possibly the CDF update should have been a separate issue, but it is related.

@ilona-irf Note that requirements can come from both SOC and ROC. They can be different but must be consistent and I assume that SOC overrides ROC by default.

thomas-nilsson-irfu commented 2 weeks ago

Just adding a short comment here: The CDF lib included in irfu-matlab (latest devel/MMSdevel branches) is the latest CDF patch for Matlab (v3.9.0) released by NASA SPDF, master branch of irfu-matlab still uses v3.8.1. (Currently I known that NASA do have a v3.9.1 in the works but it is not yet officially released so I have not included it in irfu-matlab).

ErikPGJ commented 2 weeks ago

Note that there is a GitLab account for BICAS specifically at ROC too, in addition to irfu-matlab. It contains issues which should be related to this one (possibly subsets of this one): https://gitlab.obspm.fr/ROC/RCS/BICAS/-/issues/47 https://gitlab.obspm.fr/ROC/RCS/BICAS/-/issues/84 https://gitlab.obspm.fr/ROC/RCS/BICAS/-/issues/85

Issue 47 is from Feb 2021 but is still open. Unclear if it can be closed.

ErikPGJ commented 2 weeks ago

@ilona-irf Also note that the BICAS metadata in descriptor.json contains a reference to the RCS ICD version which BICAS officially supports. Not sure if the current value is correct as it is, or how much it matters, but

then this variable should be updated too. I doubt that there is anything to do on the actual BICAS interface itself (2) but one never knows (I have not heard anything for a long time, and ROC has not complained while running recent BICAS versions).

The RCS ICD version is set in bicas.const,

      MAP('SWD.identification.icd_version') = '1.4';

which is automatically passed on to descriptor.json by generating it using bicas.main('--swdescriptor').

Note: The RCS ICD covers both (1) the official BICAS interface (the "BICAS API") but it also covers (2) some dataset metadata (consortium-specific conventions?).

ErikPGJ commented 2 weeks ago

@ilona-irf Also, after updating updating dataset skeletons proper, you might need to update the MODS global attribute with information on what was updated for the datasets. MODS is not set in the skeletons but in BICAS via the data structure built in bicas.const.init_GA_MODS_DB(). It builds a data structure using objects. There is a system in place for how to use functions and pre-defined constants to avoid hardcoded duplication, even if the data structure contains duplicated data (same partial update for multiple datasets).

MODS can not be set in the skeletons since MODS also contains (mostly) information on updates to the processing which can updated independently of the skeletons. Not sure how much pure dumb skeleton information should be mentioned there, but it might, or at least for "big" changes like removing/adding/renaming zVariables.

ErikPGJ commented 3 days ago

ROC now wants us to use CDF compression (i.e. compression as part of the CDF format itself). Xavier Bonnin mentions this in the two LESIA GitLab issues mentioned above

I have not found an explicit mentioning of CDF compression being allowed or disallowed in the documents I would expect (have only looked quickly though):

This should be implemented by adding/setting the relevant flag for the CDF-writing library when it is called by BICAS. Note that CDF compression can be enabled/disabled separately for every zVariable, as well as (I think) for the CDF as a whole.

ErikPGJ commented 2 days ago

Skeleton files (.skt) say things like below,

  ! VAR_COMPRESSION: None
  ! (Valid compression: None, GZIP.1-9, RLE.0, HUFF.0, AHUFF.0)

but it seems that the valid values should be interpreted as GZIP.1 to GZIP.9 etc., depending on degree of desired compression. It seems "GZIP.1-9" is not a valid value, though skt2cdf.sh will not give an error for it.

Note that Xavier Bonnin specifically mentions "GZIP.6".

ErikPGJ commented 2 days ago

FYI, I have implemented support for zVariable compression (not "entire-file compression, the other CDF compression feature) in CDFs in BICAS. The information (compress/not compress) comes from the skeleton/master file as describe above. I have tested it on one skeleton.

ErikPGJ commented 2 days ago

FYI, that update is on SOdevel. BICAS development is always on SOdevel.

32e461134 Erik P G Johansson (2024-07-02 16:13:05 +0200) (HEAD -> SOdevel, origin/SOdevel) irf.cdf.write_dataobj(): Support variable compression

ErikPGJ commented 13 hours ago

Footnote: I use lists for different categories of dataset IDs which I can then use for automatizing (bash etc.) task relating to datasets, e.g. skeletons.

RODP (=inflight) dataset IDs: RODP_BICAS_dataset_IDs.zip

Note that this includes:

ErikPGJ commented 12 hours ago

@ilona-irf If you are interested in scripts, then interactive_replace (bash script for interactive string replacement) and so_find_* (bash functions defined in init_aliases_functions; for "globbing" using lists of SolO dataset IDs) are relevant.

I have a copy of my bash scripts (and a small number of python scripts) at brain: /home/erjo/bin/global/.

ErikPGJ commented 10 hours ago

7d5fedbbf Erik P G Johansson (2024-07-04 13:23:45 +0200) (HEAD -> SOdevel, origin/SOdevel) Compliance-fix: GAs TIME_MIN, TIME_MAX: Change Julian date-->"ISO" fixes the TIME_MIN/TIME_MAX issue.

ErikPGJ commented 7 hours ago

Change "Spaceraft" --> "Spacecraft" ...

ilona-irf commented 6 hours ago

Thanks a lot.

[SOdevel 49ff754fd] Update Software_name GA to use identificatior.identifier field from the descriptor (ICD 1.7)

Will add one entry to MODS which should refer to all updates made which are related to this issue, similar to the entry I added in SKELETON_MODS (not commited yet):
13: CDF_CHAR { "V15: Jul 2024 : Update to make compliant with SOL-SGS-TN-0009 i2.6 and ROC-PRO-PIP-ICD-00037-LES i1.7, removal of some unused optional attributes. - I.Benko (IRF)" } .

I've started working on Skeleton version 15 in DataPool (for all our datasets on branch bia_tmp). Finished fixing all GAs for dataset SOLO_L2_RPW-LFR-SURV-SWF-E https://gitlab.obspm.fr/ROC/RCS/BICAS/-/issues/84

Will turn to CDF, validate, test processing and then commit once I'm done finished fixing all the zVars, then repeat for the remaining 5 L2 datasets. Let me know if it's needed for any L3 ones.

(Not noting all the minorities here, will ping once the commit of the final skeleton/master CDF is in ROC's DataPool. Trying to minimize number of commits in DataPool.)


Bigger things about GA directly related to BICAS:


(There were many other updates made in GAs, but) things I'm considering pointing out to Xavier in the original GitLab issue are mistakes which were not detected by validator used to generate Report from SOAR (attached by Xavier in the original GitLab issue):

In the past couple of days, other people made submissions compatible with CDF 3.9, it seems like we should too(?). It also says in the note in original GitLab issue "CDF 3.9.0 will be used to generate RPW science data files" - so it seems they are (kind of) asking for it formally.