neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
352 stars 27 forks source link

Feature: Incremental database update #10

Closed davidmezzetti closed 4 years ago

davidmezzetti commented 4 years ago

Currently, ETL processes assume operations are a full database reload each run. This works well for smaller datasets but for larger datasets, it's inefficient.

Add the ability to set the path to an existing database and copy unmodified records from the existing source. This way only new/updated records are processed each run.

SQLite needs a system for reading and inserting articles/sections from another database.

Elasticsearch already handles most of this, just needs a small change to only create the articles index if it doesn't already exist. Merges will be handled by Elasticsearch based on the article id.