arnaud-morvan commented 7 years ago

QGIS Enhancement: Features writing in existing layers

Date 2017/02/13

Author Arnaud Morvan (@arnaud-morvan)

Contact arnaud dot morvan at camptocamp dot com

maintainer @arnaud-morvan

Version QGIS 3.2

Summary

The idea here is to provide a convenient way to insert / update features in existing layers using fields mapping and primary keys definition.

For the time of writing, I would highlight these two uses cases:

copy / paste of features
use of processing to insert into / update exiting layers

Advanced copy / paste of features

For now, when pasting features to an existing layer, there is an automatic fields mapping made, based on fields names. This is really simple to use, but in case the fields does not have exactly the same names on the source and destination layers, this result in a data loss. So in such case, user need, before to make the copy / paste operation, to manually refactor the source fields (using processing for example), into a temporary layer.

It would be really more user friendly to popup, on the paste command, a dialog that allow the user to define the fields mapping. This fields mapping could consist of an expression that should be evaluated on source features for each destination field.

We can also think about adding a primary key selection to allow updating existing features in the destination layer.

Processing output to existing layers

For now, the processing plugin is limited to writing new layers as output. This is a huge limitation as users oftenly want to insert or update features in existing layers like other ETL do.

We can think to use the same core and UI parts as proposed for advanced copy / paste of features.

Optional use of vector layer edit buffer

I can see two different methods to insert new features or update existing features in an existing layer.

When pasting features to an already loaded layer (which should already be in edit mode), it seems normal to use the layer edit buffer to see the result in canvas before to save the changes in data provider.

When using processing with a huge amont of features, we would prefer writing directly to vector data provider, with a defined transaction size to correctly manage the memory used.

Note that for now QgsProcessingUtils::createFeatureSink make use of QgsProxyFeatureSink, QgsVectorFileWriter or QgsVectorLayerExporter depending on the type of underlying destination. QgsVectorLayerExporter is also used by "ImportIntoPostGIS" and "ImportIntoSpatialite" algorithms.

Database providers (db2, mssql, geopackage, oracle, postgresql and spatialite) also make use of QgsVectorLayerExporterTask to handle "drag and drop" in the QgsBrowserDockWidget.

Proposed Solution

Core part

I would propose to define a core QgsVectorLayerUpdater class similar to QgsVectorLayerExporter but with some more options concerning the fields map and primary key definition.

class QgsVectorLayerUpdater : public QgsFeatureSink
{
public:
  QgsVectorLayerUpdater( const QgsVectorLayer &destinationLayer,
                         const QMap<QString, QgsExpression> &fieldsMap,
                         const QgsExpressionContext &context,
                         const QList<String> &primaryKeys = QList(),
                         bool bypassEditBuffer = false,
                         );
  bool addFeature( QgsFeature &feature, QgsFeatureSink::Flags flags = 0 ) override;
  bool addFeatures( QgsFeatureIterator &features, QgsFeatureSink::Flags flags = 0 ) override;
  bool addFeatures( QgsFeatureList &features, QgsFeatureSink::Flags flags = 0 ) override;
};

For each feature added, the updater will:

evaluate each expression in the fields map to get values for output attributes
search for a feature in destination layer for the corresponding primary keys
update the corresponding feature or insert a new one

If primaryKeys is an empty list, updater will threat all features by inserting them as new ones.

When bypassEditBuffer is set to true, the updater will insert/update the features directly in layer underlying provider instead of using the layer edit buffer.

Note that in case of features coming from QgsClipboard, we do not have a proper layer expression context, but we can use the project context instead.

UI part

Copy/Paste dialog

Concerning the copy / paste operation I would propose a dialog that takes as parameters:

Source QgsFields coming from QgsClipboad
Destination layer QgsFields.

This dialog would present a QTableView to the user with one row for each destination field and three columns:

Source expression (QgsExpression, QgsExpressionFieldWidget)
Destination field (fixed string)
Primary key (Boolean, QCheckbox)

class GUI_EXPORT QgsFieldsMappingModel : public QAbstractTableModel, public QgsExpressionContextGenerator
{
  public:
    QgsFieldsMappingModel( const QgsFields& srcfields, const QgsFields& dstfields, QObject *parent = 0 );
    QMap<QString, QgsExpression> fieldsMap() const;
    QList<QString> primaryKeys() const;
    ...
    QVariant data( const QModelIndex& index, int role = Qt::DisplayRole ) const override;
    ...
}

Processing algorithm

On Processing side, UI will be made through a new algorithm "UpdateExistingLayer" with parameters :

QgsProcessingParameterVectorLayer( "destination" )
QgsProcessingParameterFieldsMap( "fieldsMap", parentLayer="destination" )
QgsProcessingParameterField( "primaryKeys", defaultValue=QList(), allowMultiple=True)
QgsProcessingParameterBoolean( "bypassEditBuffer" )

and maybe outputs for :

number of updated features
number of inserted feature
updated features
inserted features

So we need to add a new core processing parameter class QgsProcessingParameterFieldsMap.

Note that we could refactor RefactorFields algorithm to use QgsVectorLayerUpdater to evaluate the fields map, but we cannot use the same QgsProcessingParameterFieldsMap as in that case we need to define precisely the output field's names and types to create the output layer.

Affected Files

src/app/qgisapp.cpp src/core/processing/qgsprocessingparameter.h src/core/processing/qgsprocessingparameter.cpp

new files:

src/analysis/processing/qgsalgorithmupdateexistinglayer.h src/analysis/processing/qgsalgorithmupdateexistinglayer.cpp src/core/qgsvectorlayerupdater.h src/core/qgsvectorlayerupdater.cpp

src/gui/qgsfieldsmappingpanel.h src/gui/qgsfieldsmappingpanel.cpp src/gui/qgspastefeaturesdialog.h src/gui/qgspastefeaturesdialog.cpp

Performance Implications

In copy/paste operations, we will need to evaluate an expression for each destination field and each pasted feature. But as far as this expression is only a field name, this should not have sensible implication on performance.

We should take care of performances when searching for existing features in the destination layer using primary keys.

Further Considerations/Improvements

Interaction with task manager ?

Allow saving of fields mapping / primary keys for later use.

Backwards Compatibility

Nothing relevant

Issue Tracking ID(s)

(optional)

Votes

(required)

haubourg commented 7 years ago

Thanks Arnaud, this need is quite inlined with some other needs here. We would like to extend processing to handle insert / update / delete with key mapping to achieve the beginning of ETL features in processing. Having that in Copy Paste is good, and forces to have low level methods in core, not in Processing only. ping @m-kuhn : Do you already have something being baked in that area?

nyalldawson commented 7 years ago

@haubourg - just a heads up in case you haven't seen the announcements on the Dev lists - I'm shortly starting work on porting the base of processing across to c++. It may be worthwhile holding off on your work until this is done, as it's possible certain classes will look quite different after this. I'd hate for someone to start work on anything heavily involved in processing before this is complete and result in duplicate efforts.

m-kuhn commented 7 years ago

Thanks for tackling this. I don't have any immediate plans to work on this. More some rough ideas.

I think the general approach and strategy looks very good. Below some general thoughts and questions concerning the architecture.

Looking at the core part:

class QgsFeatureWriter
{
  Q_OBJECT

public:
  QgsFeatureWriter();

  void writeFeatures(QgsVectorLayer* layer, QgsFeatureList features, QMap<QString, QString> fieldsMap, QList<QString> primaryKeys);
  void writeFeatures(QgsVectorDataProvider* provider, QgsFeatureList features, QMap<QString, QString> fieldsMap, QList<QString> primaryKeys);
};

Is this thought to be a global object? Did you consider to create a new writer per job as QgsFeatureWriter( QgsVectorLayer* layer ) instead? This approach could potentially trigger a layer lock or similar things during the lifetime of the object.
How will this play with QgsVectorFileWriter? I think there is quite a bit of overlap in functionality between the two, it would be great if they could share some code, share some API or even be unified.
Is there an alternative to exposing the QgsVectorDataProvider functionality? Maybe a bool autocommit or a bool bypassEditBuffer flag (the second one would have the same effect as the current solution but be more explicit IMO)? I have always considered it bad practice to directly write to the data provider because
- no signals are emitted from the vector layer about added features
- other metadata (like extents, ...) in the layer might get stale
Should the mapping of fields take place inside the writer? No strong objection, just thinking about the fact, that in processing it's a separate piece of code "refactor fields" right now. This is somehow related to the above comment about QgsVectorFileWriter, I'd hate to have the same remapping code in those two classes.
Only make things a Q_OBJECT if it's required (e.g. makes use of signals, slots or other MOC features) ;)

nyalldawson commented 7 years ago

I'd also strongly suggest using a feature iterator in place of QgsFeatureList. If you're writing with large layers the list will require storing them all in memory, whereas an iterator will allow processing feature by feature.

haubourg commented 7 years ago

@nyalldawson no worries Nyall, no work started at all, I just raised recently expressed needs that seem to match for different users. Thanks for the reminder for the on-going refactoring.

nyalldawson commented 7 years ago

@haubourg just want to stress that I'm not discouraging your work - more investment in processing is definitely appreciated! I've got a feeling that processing in qgis 3 will be a seriously competitive etl tool...

haubourg commented 7 years ago

Related feature request: #16638

arnaud-morvan commented 6 years ago

We plan to update this QEP with an new subclass of FeatureSink for writing in an existing layer. This new FeatureSink should take as parameter a map from input fields to output fields.

arnaud-morvan commented 6 years ago

@haubourg @nyalldawson @m-kuhn : I've updated this QEP about writing features in existing layers using processing. This may be more precise now.

nyalldawson commented 6 years ago

QgsVectorLayerUpdater( const QgsVectorLayer &destinationLayer, const QMap<QString, QgsExpression> &fieldsMap, const QgsExpressionContext &context, const QList &primaryKeys = QList(), bool bypassEditBuffer = false, );

You'll need to add coordinate transform handling here too, so that features can be appended to a layer with a different CRS.

nyalldawson commented 6 years ago

I must admit I don't fully understand your plans for processing changes.

Wouldn't we want to allow any output from an algorithm to be potentially added to an existing layer, instead of having a dedicated algorithm just for this?

arnaud-morvan commented 6 years ago

Yes, but I do not have any idea on how to introduce this in UI, we need field map and primary key definitions. For example in FME writers have their own UI and parameters.

m-kuhn commented 6 years ago

Could we proceed with this (separate algorithm) approach and potentially integrate this transparently into the algorithm handling? I.e. in the background, separate algorithms will be working but the configuration interface does not require the user to go through the model builder.

arnaud-morvan commented 6 years ago

I note that with a separate update algorithm, we will be able to save outputs parameters (fields mapping and primary keys) in models. We could have models usable from scratch for regular data imports.

And the update algorithm could be used standalone after any other processing (another algorithm or model) when we already known source fields.

So this seems to be a good compromise.

nyalldawson commented 6 years ago

I actually think there's scope for both a separate algorithm for this, AND also an option to save the output from any algorithm into an existing layer.

Yes, but I do not have any idea on how to introduce this in UI, we need field map and primary key definitions. For example in FME writers have their own UI and parameters.

What I was thinking was an extra option in the drop down menu for destinations (alongside the current save to temporary file/save to file/save to postgres/etc options) for "append to layer". Choosing this would fire up the new field mapper dialog.

What I'm unsure about with both a separate algorithm and the "append" option outlined above is how you plan on determining what fields the output layer will have? E.g. if I have a model, which has a number of steps preceding the append algorithm, and these include algorithms which alter the fields (e.g. adding an autoincrementing field, refactoring fields, etc), how will you handling configuring the field mapper options in a model?

We don't currently have any way for an algorithm to indicate how it will affect the input layer's fields, so at the least we'll need this added. But then in models there's the extra complication that you don't know the input layer fields at the time of creating a model, so how will you be able to setup the field mapping in the model itself?

arnaud-morvan commented 6 years ago

Yes, it will make really more sense if we can get output fields at model design time (in case of separate algo in a model) or before running an algo (in case of new output). But this suppose something like defining a spec like algorithms should be able to give output fields and geometry type after prepareAlgorithm, and supporting this in all existing algorithms.

In the current situation, it may is possible to run the model/algo once to get a first layer with the needed output fields, and it may be possible to use the fields from that layer as input fields for the fields mapper UI. But this may not be trivial for the user.

arnaud-morvan commented 6 years ago

We could also think to execute a dry run to get an empty output layer in memory, but this may be impossible for external tools algorithms which output shapefile for example.

m-kuhn commented 6 years ago

To have the fields of the target in configuration context would be awesome. A QgsFeatureSink processing output could have a fields property (could also be optional so we can incrementally introduce it on existing algorithms), this could be calculated based on input configuration (exisiting fields + an additional one for statistics, much depending on the config for refactor fields...).

I would go for an option without dry-run first and only do that if we find a place where it's inevitable and sensible to do.

This would offer awesome new configuration possibilities.

haubourg commented 6 years ago

@arnaud-morvan For processing part, I would be need some gui mockups to fully understand qhat you have in mind. In FME setting the WHERE clause is made here : https://knowledge.safe.com/storage/attachments/4050-formatparameters-fme-db-operation.jpg

How would you do that in the algorithm?

arnaud-morvan commented 6 years ago

My first idea was limited to have a primary key selection, using checkboxes near field relation selectors that should respond to 95% of uses cases. I've never used the WHERE clause in FME, and I do not have precise idea how we can do that in QGIS, it may be more complex to realize.

3nids commented 6 years ago

@arnaud-morvan just wondering if you have a rough idea of when this would land in QGIS?

arnaud-morvan commented 6 years ago

For now, our funding is freezed on this topic due to administrative concerns. I hope we could start on this before the summer.

m-kuhn commented 6 years ago

It would be great to have that in QGIS 3.2, what do you think about making it a grant proposal?

qgis / QGIS-Enhancement-Proposals

QEP 89: Feature writing in existing layers #89

QGIS Enhancement: Features writing in existing layers

Summary

Advanced copy / paste of features

Processing output to existing layers

Optional use of vector layer edit buffer

Proposed Solution

Core part

UI part

Copy/Paste dialog

Processing algorithm

Affected Files

Performance Implications

Further Considerations/Improvements

Backwards Compatibility

Issue Tracking ID(s)

Votes