umr-lops / xsar

Synthetic Aperture Radar (SAR) Level-1 GRD python mapper for efficient xarray/dask based processing
https://cyclobs.ifremer.fr/static/sarwing_datarmor/xsar/
MIT License
25 stars 8 forks source link

add doc to variables #50

Closed oarcher closed 1 year ago

oarcher commented 2 years ago

35 is good to add used xpath to 'history' attribute, when the variable is an xarray variable.

But there is a limitation when the variable is a scalar, and packed in a simple structure, like a dict.

For example, here are the 'attrs' dict for a dataset:

{'ipf': 2.84,
 'platform': 'SENTINEL-1A',
 'swath': 'IW',
 'product': 'SLC',
 'pols': 'VV VH',
 'name': 'SENTINEL1_DS:/home/oarcher/SAFE/S1A_IW_SLC__1SDV_20170907T102951_20170907T103021_018268_01EB76_AE7B.SAFE:IW1',
 'start_date': Timestamp('2017-09-07 10:29:51.883706'),
 'stop_date': Timestamp('2017-09-07 10:30:19.761161'),
 'footprint': <shapely.geometry.polygon.Polygon at 0x7f813bc1f640>,
 'coverage': '188km * 86km (atrack * xtrack )',
 'pixel_atrack_m': 12.646091171537792,
 'pixel_xtrack_m': 4.18874809917281,
 'orbit_pass': 'Descending',
 'platform_heading': -167.7176164347211}

Each values in the dict are python type, with no informations about xml xpath.

We could change those simple python types with a class XmlVar , that should have the followings attributes:

The doc attribute can be extracted from SAFE directory `support/*.xsd' . It seems that there is no easy way to translate xpath from xml file to xsd file.

Here is a first approach that could be used in a PR:

from lxml import etree
import os
import re
import yaml
from anytree import Node
from anytree.exporter import DictExporter

class XmlVar:
    def __init__(self, xml_etree, xpath, convert_func=None, xsd_tree=[]):
        self.xml_etree = xml_etree
        self.xpath = xpath
        self.elts = self.xml_etree.xpath(self.xpath)
        self.xsd_tree = xsd_tree

    def get_doc_node(self, name, parent=None, schema=None):
        if schema is None:
            schema = self.xsd_tree[0]

        # from name, find relevant entry in schema
        sub_schema = schema.xpath(
            "//xsd:*[@name = $n]",
            namespaces={"xsd": "http://www.w3.org/2001/XMLSchema"},
            n=name
        )[0]

        # extract documentation
        documentation = str(sub_schema.xpath(
            'xsd:annotation/xsd:documentation/text()',
            namespaces={"xsd": "http://www.w3.org/2001/XMLSchema"}
        )[0])

        # if sub_schema as a type, jump to it
        xml_type = sub_schema.get('type', None)
        if xml_type is not None:
            try:
                sub_schema = schema_etree.xpath(
                    "//xsd:*[@name = $n]",
                    namespaces={"xsd": "http://www.w3.org/2001/XMLSchema"},
                    n=xml_type)[0]
            except IndexError:
                pass

        return Node(name=name, documentation=documentation, schema=sub_schema, parent=parent)

    @property
    def get_doc(self):
        # get xpath without list numbering
        generic_xpaths = list(set([re.sub(r'\[\d+\]', '', self.xml_etree.getpath(elt)) for elt in self.elts]))
        if len(generic_xpaths) > 1:
            raise NotImplementedError('')
        generic_xpath = generic_xpaths[0]
        parent = None
        schema = self.xsd_tree[0]
        for name in filter(None, generic_xpath.split('/')):
            doc_node = self.get_doc_node(name, parent=parent, schema=schema)
            parent = doc_node
            schema = doc_node.schema

        dct = DictExporter().export(doc_node.root)
        return self._render_doc(dct)

    def _render_doc(self, dct, depth=0):
        doc = ('  ' * depth).join(('\n' + dct['name'].lstrip()).splitlines(True)) + ':\n'

        doc += ('  ' * (depth + 1)) + 'doc: %s' % dct['documentation']
        if 'children' in dct:
            doc = doc + self._render_doc(dct['children'][0], depth=depth + 1)
        return doc

if __name__ == "__main__":
    safe_path = '/home/oarcher/SAFE/S1A_IW_GRDH_1SDV_20170907T103020_20170907T103045_018268_01EB76_992F.SAFE'

    schema_path = os.path.join(safe_path, 'support/s1-level-1-product.xsd')

    xml_path = os.path.join(safe_path, 'annotation/s1a-iw-grd-vv-20170907t103020-20170907t103045-018268-01eb76-001.xml')

    xml_etree = etree.parse(xml_path)
    schema_etree = etree.parse(schema_path)

    slant_range_time = XmlVar(
        xml_etree,
        '/product/dopplerCentroid/dcEstimateList/dcEstimate/fineDceList/fineDce/slantRangeTime',
        xsd_tree=[schema_etree]
    )

    doc = slant_range_time.get_doc

    print(yaml.safe_dump(yaml.safe_load(doc), sort_keys=False))

The output is the documentation string for slantRangeTime and for all preceding blocks:

product:
  doc: L1 Product root element.
  dopplerCentroid:
    doc: Doppler centroid data set record. This DSR contains information about the
      Doppler centroid values estimated and used during image processing.
    dcEstimateList:
      doc: List of Doppler centroid estimates that have been calculated by the IPF
        during image processing. The list contains an entry for each Doppler centroid
        estimate made along azimuth.
      dcEstimate:
        doc: Doppler centroid estimate record which contains the Doppler centroid
          calculated from geometry and estimated from the data, associated signal-to-noise
          ratio values and indicates which DCE method was used by the IPF during image
          processing. With a minimum Doppler centroid update rate of 1s (for IW and
          EW where the Doppler is recalculated for every burst cycle) and a maximum
          product length of 25 minutes, the maximum size of this list is 1500 elements.
        fineDceList:
          doc: List of the fine Doppler centroid estimates for this block. This element
            is a list of fineDce records which contain the fine Doppler centroid frequencies
            that were used for fitting the data polynomial for this block.
          fineDce:
            doc: Fine Doppler centroid estimate. Each estimate represents the Doppler
              frequency at the given slant range time within the current block. Approximately
              20 estimates are performed per swath so for 5 swaths, the maximum number
              of estimates in this list is 100.
            slantRangeTime:
              doc: Two-way slant range time array for this antenna pattern [s]. This
                array contains the count attribute number of double floating point
                values (i.e. one value per point in the antenna pattern), separated
                by spaces.
lanougue commented 2 years ago

From my point of view, the best way is to store these variables as xarray variable of size () in the dataset. This is not a problem and it keeps homogeneity between all variables and all attributes/name/...

lanougue commented 2 years ago

Homogeneity makes further manipulation easier

agrouaze commented 1 year ago

The definition of each variables has been added using the xsd files #118 . There is no ascendent nodes definition but it seems to do the job. @oarcher let me know if you prefer to go for your first idea or if I close this issue.

agrouaze commented 1 year ago

I close this issue, since the current implementation seems to bring enough information. Feel free to re-open.