koordinates / kart

Distributed version-control for geospatial and tabular data
https://kartproject.org
Other
531 stars 41 forks source link

Importing a dataset from GPKG with multiple XML attachments fails #547

Open olsen232 opened 2 years ago

olsen232 commented 2 years ago

The error message is currently extremely unhelpful:

Traceback (most recent call last):
  File "kart_cli.py", line 4, in <module>
  File "kart\cli.py", line 334, in entrypoint
  File "lib\site-packages\click\core.py", line 829, in __call__
  File "lib\site-packages\click\core.py", line 782, in main
  File "kart\cli.py", line 157, in invoke
  File "lib\site-packages\click\core.py", line 1259, in invoke
  File "lib\site-packages\click\core.py", line 1066, in invoke
  File "lib\site-packages\click\core.py", line 610, in invoke
  File "lib\site-packages\click\decorators.py", line 21, in new_func
  File "kart\init.py", line 355, in import_
  File "kart\fast_import.py", line 349, in fast_import_tables
  File "kart\fast_import.py", line 532, in _import_single_source
  File "kart\fast_import.py", line 549, in write_blobs_to_stream
  File "kart\fast_import.py", line 543, in write_blob_to_stream
TypeError: a bytes-like object is required, not 'list'

We only support one piece of attached metadata XML, whereas the GPKG spec allows for arbitrarily many. Trying to edit and commit a second XML attachment has a slightly better behaviour - firstly it has a better error message: Sorry, committing more than one XML metadata file is not supported And secondly, it's slightly less likely to happen - it's much more likely that a user will try to import an existing GPKG from some other system that happens to have multiple XML attachments than that they will edit the one in their working copy in this way, and if they do, they are more likely to be able to undo what they have done (if all else fails, by running kart reset or similar).

In the GPKGs I have seen, one of the XML attachments is often junk anyway. For instance, in the following example:

First XML attachment:

<!DOCTYPE qgis PUBLIC 'http://mrcc.com/qgis.dtd' 'SYSTEM'>
<qgis version="3.20.3-Odense">
  <identifier></identifier>
  <parentidentifier></parentidentifier>
  <language></language>
  <type></type>
  <title></title>
  <abstract></abstract>
  <contact>
    <name></name>
    <organization></organization>
    <position></position>
    <voice></voice>
    <fax></fax>
    <email></email>
    <role></role>
  </contact>
  <links/>
  <fees></fees>
  <encoding></encoding>
  <crs>
    <spatialrefsys>
      <wkt></wkt>
      <proj4></proj4>
      <srsid>0</srsid>
      <srid>0</srid>
      <authid></authid>
      <description></description>
      <projectionacronym></projectionacronym>
      <ellipsoidacronym></ellipsoidacronym>
      <geographicflag>false</geographicflag>
    </spatialrefsys>
  </crs>
  <extent>
    <spatial minx="0" miny="0" dimensions="2" maxz="0" crs="" maxy="0" minz="0" maxx="0"/>
    <temporal>
      <period>
        <start></start>
        <end></end>
      </period>
    </temporal>
  </extent>
</qgis>

Second XML attachment

<GDALMultiDomainMetadata>
  <Metadata>
    <MDI key="GPKG_METADATA_ITEM_1">&lt;!DOCTYPE qgis PUBLIC 'http://mrcc.com/qgis.dtd' 'SYSTEM'&gt;
&lt;qgis version="3.20.3-Odense"&gt;
  &lt;identifier&gt;&lt;/identifier&gt;
  &lt;parentidentifier&gt;&lt;/parentidentifier&gt;
  &lt;language&gt;&lt;/language&gt;
  &lt;type&gt;&lt;/type&gt;
  &lt;title&gt;&lt;/title&gt;
  &lt;abstract&gt;&lt;/abstract&gt;
  &lt;contact&gt;
    &lt;name&gt;&lt;/name&gt;
    &lt;organization&gt;&lt;/organization&gt;
    &lt;position&gt;&lt;/position&gt;
    &lt;voice&gt;&lt;/voice&gt;
    &lt;fax&gt;&lt;/fax&gt;
    &lt;email&gt;&lt;/email&gt;
    &lt;role&gt;&lt;/role&gt;
  &lt;/contact&gt;
  &lt;links/&gt;
  &lt;fees&gt;&lt;/fees&gt;
  &lt;encoding&gt;&lt;/encoding&gt;
  &lt;crs&gt;
    &lt;spatialrefsys&gt;
      &lt;wkt&gt;&lt;/wkt&gt;
      &lt;proj4&gt;&lt;/proj4&gt;
      &lt;srsid&gt;0&lt;/srsid&gt;
      &lt;srid&gt;0&lt;/srid&gt;
      &lt;authid&gt;&lt;/authid&gt;
      &lt;description&gt;&lt;/description&gt;
      &lt;projectionacronym&gt;&lt;/projectionacronym&gt;
      &lt;ellipsoidacronym&gt;&lt;/ellipsoidacronym&gt;
      &lt;geographicflag&gt;false&lt;/geographicflag&gt;
    &lt;/spatialrefsys&gt;
  &lt;/crs&gt;
  &lt;extent&gt;
    &lt;spatial minx="0" miny="0" dimensions="2" maxz="0" crs="" maxy="0" minz="0" maxx="0"/&gt;
    &lt;temporal&gt;
      &lt;period&gt;
        &lt;start&gt;&lt;/start&gt;
        &lt;end&gt;&lt;/end&gt;
      &lt;/period&gt;
    &lt;/temporal&gt;
  &lt;/extent&gt;
&lt;/qgis&gt;
</MDI>
  </Metadata>
</GDALMultiDomainMetadata>

In this example, the first XML file happens not to contain any useful information, and the second XML file is just a wrapped-and-escaped version of the first XML file that needs extra parsing. The junkier XML file is slightly longer in this case, so we can't use assume that "longer" means "more informative" if we develop a heuristic to decide which XML gets to stay. We can probably detect the case where >90% of an XML file is just the same as the other XML file and the remainder is just boilerplate.

olsen232 commented 2 years ago

Error message is improved, and the particular example shown above now drops the second (junk) XML file. https://github.com/koordinates/kart/pull/548