oduwsdl / MemGator

A Memento Aggregator CLI and Server in Go
https://memgator.cs.odu.edu/api.html
MIT License
56 stars 11 forks source link

Potential to improve aggregation algorithm? #115

Closed machawk1 closed 5 years ago

machawk1 commented 5 years ago

https://github.com/oduwsdl/MemGator/blob/542d96df52fe32e384ef8185fe8e3cbb8ea942a0/main.go#L452-L466

...seems to indicate a form of "online" insertion sort, which seems like it would be temporally expensive. Is there the potential to optimize this algorithm for more efficient sorting of URI-Ms (and other metadata) from a set of TimeMaps to a combined one?

machawk1 commented 5 years ago

@ibnesayeed and I spoke in-person and I was very wrong. @ibnesayeed If you could provide some explanation here, I think it would do you implementation justice.

ibnesayeed commented 5 years ago

The algorithm used here is pretty efficient and optimized for the kind of data we deal with (where it is mostly linear in practice). We are exploiting the fact that most of the archives often return TimeMaps already sorted in chronological order. In that case it is simply a matter of traversing the linked list of a TimeMap from an upstream archive and the accumulated linked list in the same direction and merging nodes from the new one into the accumulated one in minimum number of traversals and comparisons. In rare cases when an archive provides data in random or reverse order, the cost will go a little up, but not at the time of merging, instead when that individual upstream TimeMap is being parsed and converted into a linked list. The reverse order can further be optimized by specifying the sort order in the archive description (in the JSON file), but I did not know any archives that return data in reverse chronological order, so never attempted to add more logic for that. There are some other optimizations in place when lists are being merged, for example assigning new linked list to the base (or accumulated liked list) if the base is empty or swapping the base with the new one if the new one is longer so that a short list is merged into a long one and not the other way, hence fewer number of comparisons.

That said, if you find any other potential optimization opportunities without increasing the memory footprint then please feel free to discus and implement them. TimeMaps can be very long, so memory will remain a concern when selecting a sorting algorithm.

ibnesayeed commented 5 years ago

@machawk1 I am closing this now, but please feel free to reopen if you find some issues in it.