keeps / dbptk-developer

DBPTK Developer - library and command-line tool for execution of database preservation actions
http://www.database-preservation.com
GNU Lesser General Public License v3.0
43 stars 19 forks source link

Oracle: Convert SDO_geometry column content to GML #343

Closed luis100 closed 5 years ago

luis100 commented 5 years ago

Agencies have geodata within columns of tables. Usually we have Oracle databases with SDO_GEOMETRY data type and we need to take this information as well but we would prefer to take it as GML.

Laurira commented 5 years ago

Phillip point out that it is reasonable to put the gml into separate files (and add links to siard) not directly into the siard.

Pros:

  1. user finds the file and is able to import the separate gml file to QKIS for example.
  2. siard is not so messy

Cons:

  1. you can not search the coordinate via DBVTK (but I am not sure how much is this search needed anyway).

What are your thoughts? Should we put all the gml-s into one folder. Then it is easier to transfer the files to QKIS for example?

Laurira commented 5 years ago

I talked with our danish collegues and made some experiments with our own maps.

DBPTK should produce two versions of gml-data:

  1. Small gml file for each table row. This is for user who wants to upload the gml to QGis for example and see his/her land on the map.
  2. Big gml file for entire table. Data analyst need to make statistics so he/she should be able to import gml to one layer. (If he has 10 000 gml-s then it is impossible). Danish have set a rule that gml files should be limited to 1 GB. So actually this version of export may produce more than one gml but it is still usable for data analysts (For example QGis can handle 100-200 layers at once). The limit 1 GB sounds reasonable at the moment...

Our institutions use mostly estonian specific coordinate system (EPSG:3301 - Estonian Coordinate System of 1997), but it is useless for GPS devices or google maps. During the conversion it should convert the geometry to WGS84 (that is most used coordinate system).

So it seems that before the conversion user should specify two things in some dialog: current coordinate system and the limit for the big GML(s).

chalkos commented 5 years ago

From my experience, it is best if DBPTK aims to save information in the most simple and documented way. This has a lot of advantages from a software development point of view.

I agree with Phillip:

Phillip point out that it is reasonable to put the gml into separate files (and add links to siard) not directly into the siard.

And the only "con":

  1. you can not search the coordinate via DBVTK (but I am not sure how much is this search needed anyway).

Is something that can be handled specifically for DBVTK and GML files if and when it is necessary.

The GMLs should be saved like other lobs, with the appropriate file extension and some documentation on the SIARD about it being a GML file.


I propose saving the small gml files for each table row. If a "source coordinate system" parameter is provided, then a coordinate system conversion will be done to (for example) WGS84.

An optional parameter that limits the GML file size can also be implemented. If DBPTK is about to create a GML file bigger than that, everything fails and the whole migration is aborted. This is ok since this is an optional parameter and by default DBPTK would save GML of any size.

To create the big gml file for the entire table (which seems to be a requirement, let me know if it is not) it would be better to do one of these:

  1. Have the data analyst use a different software that merges GML files together (if this is a common use case I'm sure a good software must exist to do this already); OR
  2. Develop a new DBPTK export module that produces a big GML file for each table.
Laurira commented 5 years ago
  1. this separate GML export module is not a bad idea. As gml can contain additional fields as well along with geo-data then user should be able to select out the fields that will go to the GML. Then this data will be accessible with some GIS as well after the GML import.
tomasmferreira commented 5 years ago

Exported SDO_GEOMETRY cells are converted to a xml geometry that can be saved in the SIARD file as text. Multiple xml geometries can be combined into a .gml file.

When exporting with -egml <directory> (SIARD2 parameter), a .gml file will be created for each table containing SDO_GEOMETRY cells. This file will include all the geometries present in said table, creating a "feature" for each row which includes the geometry, and uses the rest of the columns as metadata.


dev notes: using -egml injects a FilterModule that handles saving data in the gml file and forwards the cell values (without changes) to the siard export module.

Laurira commented 5 years ago

I tested this with one of our institutions database which has SDO_GEOMETRY present.

I got following result in metadata.xml

<column>
    <name>GEOMEETRIA</name>
    <lobFolder>lob7</lobFolder>
    <type>CHARACTER LARGE OBJECT</type>
    <typeOriginal>SDO_GEOMETRY</typeOriginal>
    <nullable>true</nullable>
</column>

But where is this lob folder lob7? It is not under content. And in Content -> table1.xml I have

<row><c1>*****</c1><c2>2017</c2><c3>******</c3><c4>0.394</c4><c5>Põllusaar</c5><c6>2016.11.02 08:46:32</c6><c7>&lt;gml:Polygon&gt;
\u0020\u0020&lt;gml:outerBoundaryIs&gt;
\u0020\u0020\u0020\u0020&lt;gml:LinearRing&gt;
\u0020\u0020\u0020\u0020\u0020\u0020&lt;gml:coordinates&gt;
\u0020\u0020\u0020\u0020\u0020\u0020&lt;/gml:coordinates&gt;
\u0020\u0020\u0020\u0020&lt;/gml:LinearRing&gt;
\u0020\u0020&lt;/gml:outerBoundaryIs&gt;
&lt;/gml:Polygon&gt;
</c7></row><row><c1>*****</c1><c2>2017</c2><c3>*****</c3><c4>0.019</c4><c5>Põllusaar</c5><c6>2016.11.02 08:46:32</c6><c7>&lt;gml:Polygon&gt;
\u0020\u0020&lt;gml:outerBoundaryIs&gt;
\u0020\u0020\u0020\u0020&lt;gml:LinearRing&gt;
\u0020\u0020\u0020\u0020\u0020\u0020&lt;gml:coordinates&gt;
\u0020\u0020\u0020\u0020\u0020\u0020&lt;/gml:coordinates&gt;
\u0020\u0020\u0020\u0020&lt;/gml:LinearRing&gt;
\u0020\u0020&lt;/gml:outerBoundaryIs&gt;
&lt;/gml:Polygon&gt;
</c7></row>

Where should be the coordinates?

PhillipTommerholt commented 5 years ago

Normally and according to the SIARD2 standard LOBs can be stored three places :

I think the standard of the DBPTK is to store the LOBS inside the SIARD table folder structure if nothing else has been declared. So you should be able to find the LOBs in the table folder for the specific table?

image

Laurira commented 5 years ago

Thank you Phillip for these explanations. Problem is that the lob folder "lob7" is not present.

lob7FolderMissing

PhillipTommerholt commented 5 years ago

It seems like a bug then. Have you tried to use the parameters in DBPTK to store lobs externally outside the .siard-file?

(edit: it also looks like there is no schema0? It might be related to the schema handling?)

chalkos commented 5 years ago

@Laurira can you attach/send me the log?

Laurira commented 5 years ago

I sent you the logs and siard to the email.

chalkos commented 5 years ago

@PhillipTommerholt by default, DBPTK stores small lobs in the XML, bigger lobs are stored in their own file (inside the specified folder). The folder is specified in the metadata because if a lob is saved in a file, that's where it will be. In this case all the LOBs were small, so they were all inlined in XML.

About the schema0, in the data files: rows' counting starts at 1, columns count also starts at 1, and the specification does not state that schemas/tables must start at 0, so to avoid confusion DBPTK always "starts counting" at 1.

The schemaX or tableX folder is specified in the <folder> element in the metadata (for schemas as well as tables).

@Laurira I have created a new issue for the bug, see #371