Closed bluestreak01 closed 1 year ago
Really looking forward to this feature
would it be possible to specify which columns are used to evaluate whether the data is duplicate or not?
Use case: you are listening on market data feed, one of the columns (which is actually also designated timestamp) is time when you receive the data, you want to prevent any data loss (for example during the releases etc.), so you are running the MD listener in multiple replicas but you don't want to have duplicate data in quest. But because we are storing also arrival timestamp (which would differ on each md listener instance) the data will never be same for the full row ... so ability to specify which columns are taken into account for de-duplication logic would help a lot and probably should be simple to implement such support?
Lets break up deduplication scenarios:
Exact same data arrives into the database via different routes. One example is that from @postol. The exact same trade arrives into database via different feed handlers. Another example is position broadcast. A moving object broadcasts its position via both satellite and terrestrial receiver. Same data arrives into database from both sources at different time.
In this scenario database has to realise that this data is already in stored. The data row however can contain values from both the original source of data and the intermediaries, such as satellite, terrestrial antennae or a feed handler. Use should be able specify which columns contain values from original source and also criteria as to which data "route" their prefer.
Sensor data. Sensor ticks with the same frequency, but the reading does not always change. In this case storing consecutive "same" values does not add any value. For example these rows are duplicate and can be stored as one row:
sensor | timestamp | value |
---|---|---|
A | 00:00:00.001000 | 120 |
A | 00:00:00.002000 | 120 |
A | 00:00:00.003000 | 120 |
A | 00:00:00.004000 | 120 |
Stored as:
sensor | timestamp | value |
---|---|---|
A | 00:00:00.001000 | 120 |
Meaning that since 00:00:00.001000
until now
the value of sensor A
is 120
At the same time if sensor "blips" and goes back to normal, like so:
sensor | timestamp | value |
---|---|---|
A | 00:00:00.001000 | 120 |
A | 00:00:00.002000 | 120 |
A | 00:00:00.003000 | 130 |
A | 00:00:00.004000 | 120 |
De-duplicated data would look like:
sensor | timestamp | value |
---|---|---|
A | 00:00:00.001000 | 120 |
A | 00:00:00.003000 | 130 |
A | 00:00:00.004000 | 120 |
To give the same information, that as of 00:00:00.004000
the value of sensor A
is 120
.
The logic that can be employed in Scenario 1 cannot be employed here.
Database data refill. Lets imagine we published some data into database in batches. Then we found out that some data is missing. So we source one of the batches we published, but now with some new data. What we could do is to republish the entire batch, but we would like the database do not duplicate data that's already there. Just add what wasn't there.
In this scenario de-duplication logic will be based on the entire row and can leverage timestamp field, which is "indexed". It is somewhat similar to Scenario 1, but does not require user-provided configuration.
I read what is written here two or three times (btw very much looking forward to this feature).
My logic is that for many users, the way they consume data will not actually make it possible to take any advantage of this feature. For some users, it will be possible though and that's the target audience.
Here is a real example: A user consumes trade-by-trade data from exchange. The exchange labels each trades with an ID. Often the trades from the exchanges will be truly identical, but the ID will be different - that's the only way to know it's a duplicate. If there is no ID, there is no real way to know - we can only guess. In he case with trade ID - we can take full advantage of the feature (provided the comment by @postol is taken into account). If there is no ID - we can't really do much.
P.S. There should be Prometheus endpoints counting the number that duplicate rows were found.
Mainly timeseries column
We have a ingestion job which fetched a log file from S3 and inserts every row to QuestDB. We use file name as Job ID. So suppose a file ABC.log was fetched from S3 which has 10k records. Our java application triggered a Job with ID ABC.log and started inserting these rows in QuestDB. Around midway due to some unhandled exception this job failed. So around 5k records were inserted and 5k were ignored. We have retry mechanism so all failed jobs eventually will be submitted for retry. So job with ID ABC.log will be re triggered and this time all 10k records were inserted. But since 5k records were inserted on first try, and 10k on second try now I have total 15k rows in my table 5k of which are duplicate. If there was a way to skip duplicate records based on some columns, it would be very helpful.
How are we handling this currently? So for now we have a workaround for this. We have a column isValid in table. A row is valid only when this column is set to true. So when job is resubmitted we fire an update query to make isValid = false where JobId = ABC.log So now out of these 15k records 5k are not valid and only 10k are valid. This is working fine for now but doesn't seem very efficient solution.
Especially when recording sensor values in realtime, short outages may occur, resulting in small gaps. Later, you typically receive a complete dataset (e.g. recorded on-board), which must then be imported to fill these gaps. Obviously you only want to ingest the missing data and don't want to duplicate the already existing rows.
released in 7.3, see https://github.com/questdb/questdb/pull/3566
Summary
Fully identical data rows will be removed at
commit()
Intended outcome
Duplicate data arriving from multiple sources will be silently discarded before hitting the disk. Such data should not be fuelling storage cost and increasing query times.