Open pgulley opened 5 hours ago
I'm unclear what the goal is:
Replace the DirectoryApi
object in https://github.com/mediacloud/api-client/blob/main/mediacloud/api.py with another library?
But on the subject of "directory issues":
Something related I've been concerned with is whether the current data model (Source === Domain) can work long term with the current (and expanding) users maintaining the directory.
The basis of my concern is thatthe values in the Source.name and url_search_string fields have been shown to have (to be kind) an exciting level of variation, which in combination with limited validation, has led to duplicate sources being created (ones with http:// in the "name" field which will never return search results), with the new sources being added to collections.
On the Source/Domain front: When I was looking at how we might use sitemaps, I looked at a bunch of important sources we might not be getting data for, and I came up with three that have, over time changed their primary domain (London Times, BBN, and Huffington Post). The only way our current data model can handle this is to create another source, and add it to all the same collections as the original, and hope and pray that any time anyone adds an afflicted source to another collection, that they add both sources. Now it might just be that I found the only three examples of such a thing that will ever happen, but considering that we have 1.5 Million sources, I can't imagine there aren't undiscovered cases, and ones that will occur in the future. We may be pulling stories right now from domains that don't appear in the directory, and are what I've been calling "dark matter". And yes, I'm sure we could work around this by adding a source field that reference the "primary source" entry, but it seems like a jury rigged solution, and the sources table needs to be split into sources and domains, where each domain belongs to exactly one source, and the source name field is (ideally) a human readable string.
It's funny to me: I always used to hear about people who got hot and bothered about data modeling minutiae, but I guess I've become one, at least for data that I spent significant time trying to rationalize. The truth is that the only way I could keep my sanity in the preparation of the sources tables for the new system was to repeat to myself "you just need to leave it better than you found it", because it was just too large, with too many dark corners to ever make fully rational!
And for validation, the question is one of where should it be done? In the Web UI backend, CSV ingest of sources, API calls? The most encompassing solution might be "everyone goes thru an API endpoint", and put it there...
And an itty bitty issue for me is that the UI sources/collections search doesn't seem to be "free text" if you enter multiple words, they need to appear in the same order they do in the name of the source/collection. One that still confounds me is a UK National and Local collection that I can never remember how to enter!
Making an issue to put this on folks radar. The point of this repository is to centralize logic for tracking and acting on issues in the directory as a library package, since we don't have all of the. Right now that looks like some beefed up class representations of Collections and Sources, and a framework for adding new 'issue' logic. The api is demo'd in the various notebooks I've included.
big question- it occurs to me that other kinds of directory interactions might go here- is 'directory-tools' a better name? I'd like to get a quick review of the architecture decisions I made, before going too much further.