Decouple default archives list from online source

machawk1 commented 8 years ago

MemGator currently looks to http://git.io/archives on startup by default. If git.io goes down, MemGator has no default list of archives. Coupling a local service's functionality to a remote online resource is bad. MemGator ought to work with smart defaults without relying on this resource.

INFO: 2016/09/08 22:02:01.250427 Initializing MemGator:1.0-rc5...
INFO: 2016/09/08 22:02:01.250524 Loading archives from http://git.io/archives
FATAL: main.go:831: Error reading list of archives (http://git.io/archives): Get http://git.io/archives: dial tcp: lookup git.io: no such host

ibnesayeed commented 8 years ago

I thought about it earlier when MemGator was born and concluded that the current approach is the most practical, perhaps not ideal though. The other two approaches I thought about are following:

A default archive list in the code would remove the reliance on an external service or a config file, but would require updates to the binary each time something changes in the curated list of default archives.
A fallback archive list in the code that is used only when reading the default registry fails, would show a magical behavior as the user might not even notice what archives are actually being aggregated if not paying careful attention on the startup message. Additionally, it would suffer the same problem as the previous approach.
A local default archive list file would add additional steps in installing and running the tool. This would make the tool less portable and the out-of-the-box experience would suffer. Additionally, the curated changes to the list of archives would not make it to users' machines as not everyone is always aware of the changes in web archiving sphere.

For advanced users, it is almost always better to use their custom or local archives file and not rely on an external curated list of archives that might go down. Luckily, this service is hit only once on the startup of the tool then it caches the list of archives in the memory for the entire session. Additionally, a failure to read the curated list file results in the fatal error with precise message to explain what went wrong.

That said, do you have any other mechanism that might work better in this case, please feel free to propose.

machawk1 commented 8 years ago

Not the best solution, but to mitigate the effect that git.io has on MemGator instances, would it be possible to consult a second or even tertiary source redundant of the information at git.io?

ibnesayeed commented 8 years ago

That is doable, but it would cause a sync overhead in which we will have to find a few distinct hosts where we can keep the copies of the curated list of archives and be able to update the content without changing the URI (e.g., Gist wont work here). The current source is part of the repository, hence it's easy for anyone from our team to update that, another hosting service might not be that easy for all of us to have write access to. The other thing that needs to be considered is to not try other sources if the --archives flag is explicitly set by the user, even if the custom value is the same as default.

machawk1 commented 7 years ago

https://github.com/jteeuwen/go-bindata might help with this, see http://rachbelaid.com/embedding-assets-in-go-project/ . I think having the JSON data built into the binary at compile time is a good, safe, default instead of having many people's binaries relying on an online file that you can manipulate.

ibnesayeed commented 7 years ago

I have considered embedding the default list inside the code (we don't even need binary data embedding for that), but shipping default data has it's own implications which I described above.

ibnesayeed commented 2 weeks ago

TIL https://pkg.go.dev/embed

It does not really solve the concerns I have when it comes to shipping the default list with the binary, but it does so rather neatly, so in case we plan to bundle the list with the binary, we can go with this standard library package instead.

machawk1 commented 1 week ago

Someone related is the upcoming unsetting of the .io TLD, re: https://www.bbc.com/news/articles/c98ynejg4l5o

Would it be more persistent to host this at the same TLD as the source code, e.g., on GitHub.com as a raw file from the repo? Are there content-type concerns with doing this? A fallback URL would also be good. I have some fallback logic in Mink, which relies on a MemGator endpoint, to fallback if/when the ODUCS instance goes down. It seems like the same problem at a different scale.

Is there currently an action item to this GitHub issue? It's over 8 years old.

oduwsdl / MemGator

Decouple default archives list from online source #88