globalgov / manydata

The portal for global governance data
https://manydata.ch
GNU Affero General Public License v3.0
9 stars 0 forks source link

Make `coalesce_compatible()` faster #228

Closed henriquesposito closed 1 year ago

henriquesposito commented 2 years ago

For now, some consolidations in manyenviron take over 30 minutes and this is related to how we identify and coalesce duplicate rows.

Using duplicated (with from last) is rather slow (see https://stackoverflow.com/questions/37148567/fastest-way-to-remove-all-duplicates-in-r) so perhaps we can speed things up using rle and sort (see https://stackoverflow.com/questions/1923273/counting-the-number-of-elements-with-the-values-of-x-in-a-vector).

henriquesposito commented 1 year ago

The latest upgrades to ´consolidate()´ bypassed ´coalesce_compatible()´ to fix issues with matrix size (memory usage) and made the function faster already. However, there is still room for improvement as consolidating large databases can still take up to 1 hour...

To do:

henriquesposito commented 1 year ago

The main issue slowing down consolidate(), now that it works, was the huge amount of NAs introduced in the key variable during the full joins. This issue was solved by dropping extra NA rows before "filling" any missing values. Doing so does not generally affects the outcome of the function or the logic, as the first non-missing value is used for rows when they are resolved by key. We can also favour certain datasets when consolidating if we want to.

Now consolidating large databases as ´manyenviron::agreements´ taking all rows and columns takes about 1 minute on my OS (this took almost 1 hour beforehand). Alternatively, consolidating very large databases as ´manyenviron::memberships´ still takes awhile... I am looking into this issue.

I have also added info messages for the steps the function is working through.

To do:

jhollway commented 1 year ago

Would it help accelerate the manyenviron::memberships consolidation to outsource the identification of distinct agreements to a first consolidation of manyenviron::agreements?