informagi / GeeseDB

Graph Engine for Exploration and Search
MIT License
35 stars 4 forks source link

Fix posting sort order and memory issues in CIFF exporter #30

Closed gijshendriksen closed 1 year ago

gijshendriksen commented 1 year ago

This PR solves two issues with the CIFF exporter:

  1. The postings were sometimes sorted incorrectly, resulting in negative gaps in the CIFF posting lists.
  2. Loading the entire DuckDB result set for all posting lists could cause memory issues for large collections. Loading (and writing) them in batches seems to solves this issue.

It also adds a progress bar to the exporter (toggled by the --verbose flag), which gives more insight into the progress of the exporter and can be helpful to get time estimates.

gijshendriksen commented 1 year ago

Ah, for some reason tqdm was working for me locally, even though it is not in the requirements.txt. @chriskamphuis would it be fine to add that as a dependency? Or should I remove the progress bar so we don't need it at all?

chriskamphuis commented 1 year ago

Adding tqdm in requirements and setup is fine!

gijshendriksen commented 1 year ago

Alright, thanks! Added it to the dependencies :+1: