Handling data pipeline from RawData to Production

goat-community / goat

This is the home of Geo Open Accessibility Tool (GOAT)

GNU General Public License v3.0

89 stars 47 forks source link

Handling data pipeline from RawData to Production #1473

Closed metemaddar closed 1 year ago

metemaddar commented 1 year ago

We first collect the data (called raw data) and we need to transfer it to the production database. While transferring we have some issues that we need to handle:

Transferring data can take long time. As they are big. So the server can get out of service. We need to reduce the transfer time.
There are some connected objects that when we replace new data, the connection get lost. So we need to handle this situation.

Example situation (Edge table)

To find a solution for this issue, we can work on a sample data flow for Edge data. The user produces WayModified objects that are connected to Edges.

Issues

When we update the data in the Edges table, the id's get changed (As we should wipe and insert data again) and the data can be changed as well (id and other attributes). For example: the road (geom) changes, or other attributes, like it gets new features, like new sidewalk etc.
So the connection between the WayModified and Edge can get lost in some situations like:
- When the geom has big change and we can not identify :warning:.
As the ids in new data is different than ids in the previous data, we need to do searches (like geom matching) to prepare the connection again.
Otherwise we should say that the link between WayModified and Edge is broken (is outdated).
Also when we find the connection, we need to inform the user about the changes, So they can confirm or do modifications to match their needs.

Summary

In this issue we find a data flow to create concrete task for updating data from RawData to Production. So that we can:

handle link breaks
choose fastest way to transfer
prevent data lost while transferring

metemaddar commented 1 year ago

An initial solution process for this data flow would be like:

backup the Edge table data that has link to WayModified to a new table (EdgeBackup)
Duplicate the way_modified.edge_id ---> way_modified.backup_edge_id
Wipe edge table
Insert new edges to edge table
search to match way_modified.edge_id to the new inserted ones.

In this scenario we should not reset the sequence counter of Edge table. Because in several data movement, the IDs will change and we may have duplicated BackupEdge.

Then if we could link the way_modified to the new inserted edge we have 3 connected objects:

graph TD;
    WayModified-->Edge;
    WayModified-->BackupEdge;

We can compare the BackupEdge and Edge and show the user what have been changed so that they can accept the change or do needed modifications to the WayModified.

metemaddar commented 1 year ago

Another solution: not wiping data (Thinking out of the box :package:)

Another solution out of the box would be not deleting the data at all. We can just modify the present data and not recreate them. We can load the specific data that we want to put into the database (Into in-memory database) and find the previous data and just modify them. And insert new data (new edges for example) at the end of table.

This data comparison should be in two ways. And previous data should be also checked against new data to find broken modifications.

Benefits

This has also benefit that it won't stop the server and pause the user while we might do only small modifications on the server and it may take no time.

Cons

It may take more time while moving data

Steps of the process :footprints:

There are two kinds of data:

Data that we can gather inside an area
Other data

For the data that are inside an area

we can load both data into in-memory database (present and new data)
find matches
do modifications
update modified data to the database

For other data, we need another approach

Maybe we need to load data bunches to in-memory database
And do the steps of [2:end] of in-area data steps :arrow_up: .

At the end of algorithm we need to tag modified data (Or take backup of them) to inform the user about changes.

To speedup

For both of data types, we can tag the found-match data (In a set for example) to prevent recalculations.
The matching process can take place in another server than the production to avoid engaging production server's processors.
We can put the fastest algorithms, And maybe we don't need to check all of present data, And if they were matched the algorithm stops.
We can also record previously matched data in RawData database to prevent involving them in the future.
If we had reflected database of production base data, we don't need to hit the production database at all while finding matches.
Using study_area relation lets us to narrow the data into study_area (Geographically and Foreign-Key based)

metemaddar commented 1 year ago

Record modified data

We can use some queries like this to record changes by time (ref):

INSERT users_final (username, password, email, stamp_created)
  SELECT username, password, email, NOW()
  FROM users_temp
  WHERE id=8

Change ID column name

We can change the ID column name to parent_id while we may have several modifications on same record. We can use some queries like this. So we can have new id column on backup table.

metemaddar commented 1 year ago

About not wiping data and just updating

As Elias says some tables are needed to be wiped like Edges. Because when we have a small changes in a row, it can affect almost all the rows.

Shorten interrupting the user

Still we can reduce the time of interruption. We can do the updates per scenario. We can add a field like disable or disable_until for the scenario. So when we start updating data about a scenario, We can first disable that scenario, So in the client we can tell the user that this scenario is under construction. And by having the field (disable_until) we can inform the user how much time it takes for update. So they can organize their jobs. After completing one scenario, we can move to the next.

Disabling depended endpoints

We need to disable all the endpoints connected to the disabled scenario. So this can prevent any failure while updating the database. These endpoints should show suitable error while the scenario is disabled. For example 503 and a brief description about the interruption.

:warning: We still need to have access to these disabled data at the Administration panel. So the administration endpoints should not be affected.

:bulb: To achieve this we can use a Fast-API middle-ware or SqlAlchemy listener to handle such situations. We can first filter all the queries for disabled scenarios. If our query was empty, then it can query again to check if it is disabled. And check when it will get enabled again and create sufficient error.

Concerning the ID counter

The thing that I'm concerning about this approach is the counter of the tables ID. The BIG-INT data type for PostgreSQL can hold up to +9,223,372,036,854,775,807. Maybe we need to calculate when we exceed this number, but the number is big enough and it's not needed to be reset for long long time.

Calculating the IDs

The biggest data count I see at Dev database at the moment is: 2700000 rows. And by dividing the BIG-INT into this number we will get the number: 3,416,063,717,353 So we can have this number of times for updates :heavy_multiplication_x: scenario_edge

metemaddar commented 1 year ago

For testing I copied the basic.edge to temporal.raw_edge. Here is the example query to find matches (ref):

select t1.id,t2.id from basic.edge as t1 join temporal.raw_edge as t2
ON
    ST_DWithin(t1.geom, t2.geom, 100)
    AND
    ST_Equals(ST_Snap(t1.geom, t2.geom, 100), t2.geom)
    where t1.id > 1000000 and t1.id <1100000 and t2.id > 1000000 and t2.id <1100000;

This query took 2m 35s to fetch.

metemaddar commented 1 year ago

After finding all matches, we will have four types of data, each decide what to do with them:

Fully matched: No changes needed
Matched geom but updated other columns: Add for update process
Data is in production table but not in raw table: Mark as deprecated data (Maybe delete after getting backup)
Data is in raw table but not in production table: Mark to add to the end of production table

Raw data compare(1)

metemaddar commented 1 year ago

Progressive match

As matching data takes long time, it would be good to have some features:

See the progress of job
Save the state if we needed or suddenly stopped the job
Start from previous stage if stopped

To achieve this we can use a loop over the raw data. each time, choose a bunch of data (For example 100 rows). Do search over them and save the result to matching table. So, by query on the matching table, we would see how much of the data was calculated. Also if suddenly we stopped the process, we can continue from previous state. Also it lets us to continue form previous state by query on matching table.

This lets us to show the progress on the administration client and options to pause and continue the calculations. For example if we found another data with higher priority. So that we can easily pause and start another one, then continue back again.

Optimize

Before start matching, we can pull the data from both tables (raw and production), or do the matching on the raw database. But it needs some optimizations considering the PostGIS docs. We can also exclude the calculated data by matching table.

EPajares commented 1 year ago

For testing I copied the basic.edge to temporal.raw_edge. Here is the example query to find matches (ref):
select t1.id,t2.id from basic.edge as t1 join temporal.raw_edge as t2
ON
    ST_DWithin(t1.geom, t2.geom, 100)
    AND
    ST_Equals(ST_Snap(t1.geom, t2.geom, 100), t2.geom)
    where t1.id > 1000000 and t1.id <1100000 and t2.id > 1000000 and t2.id <1100000;
This query took 2m 35s to fetch.

The query is indeed very slow. I would like to propose the following alternative.

WITH e AS 
(
    SELECT id, geom 
    FROM basic.edge 
    LIMIT 100
    OFFSET 100
)
SELECT e.id AS id1, n.id AS id2, e.geom AS geom1, n.geom AS geom2
FROM e
CROSS JOIN LATERAL 
(
    SELECT r.id, r.geom 
    FROM temporal.raw_edge r
    WHERE r.geom && e.geom
    AND ST_HAUSDORFFDISTANCE(r.geom, e.geom) < 0.00000001
        LIMIT 1
) n;

Inside the CTE we are selecting 100 edges that are used for the loop. The number of edges can be adjusted but it is suggested to choose not more then a couple of 1000 edges at the same time to avoid memory problems. We make use of the fact that an intersection of the bounding box of the geomtries using && is sufficient to find a spatial intersection in this case. In the second step we apply the HAUSDORFDISTANCE, which indicates a similarity between two geometries. This part could be adjusted to find almost equal edges.

For 100 edges I reach a computation time of 657ms. So approximately 6.56ms per edge. I still don't really consider this fast but I think there might not be a faster way. We could also consider running this with parallel database queries in Python as we can easily split the whole network using LIMIT and OFFSET keywords into chunks.

metemaddar commented 1 year ago

Filling hierarchy

When we decide to fill a table, There can be situations that the table is depended to another one. means the new data has some dependencies which should go first to another table.

example

An edge object is depended to two nodes (source and target). So when we decide to copy an edge from another database, first we need to add their source/target to Node table and then connect newly inserted edge to the inserted nodes.

Automate solution

It is possible to determine the foreign keys of the table with SQL. But we have some tables that have some foreign keys that are not real dependencies. For example for edge we have the following foreign keys:

edge_edge_id_fkey
edge_scenario_id_fkey
edge_source_fkey
edge_target_fkey

But ony the two last ones are needed to be filled as the edge_id and scenario_id are created just to have similarity with edge_modified for other reason.

Does non needed relationships stop us using automate hierarchy determination?

I think not. When we try to fill non needed relationships the code returns zero rows and so it exits without problem.

Implement the hierarchy filling

As an example for Edge, we need to first fill the node. So, first we need to select the connected nodes from the source table. Then after moving nodes to prod table, we need to map the node id at source to node_id at target. So after filling the Edge table we can connect them. We also need a map between raw_edge and prod_edge. Source and Target are nullable. So we can Fill the Edge table without problem.

metemaddar commented 1 year ago

How to map newly inserted data IDs with data in raw db?

Store the last row ID of target table
Sort the raw data and then insert to target database
Select and sort by id both source and target tables + a counter. the target table can be selected by saved last row ID.
Now the raw data and target data can be mapped by joining the counters.