mediacloud / news-search-api

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).
https://mediacloud.org
GNU Affero General Public License v3.0
1 stars 3 forks source link

PROXY_BASE_URL #84

Closed pgulley closed 2 weeks ago

pgulley commented 3 weeks ago

Working on cleanup tasks- I've noticed a thread weaving though many requests is a "PROXY_BASE_URL"- but it looks like it was abandoned both on the configuration and implimentation end. I'm not sure what this originally was for or if it's meant to be doing something we expect, but right now it's just jetsam. Any feelings about excision?

pgulley commented 3 weeks ago

Defined here and dangling here - it looks like it was probably related to IA redirecting to their content mirror?

philbudne commented 3 weeks ago

This?

def proxy_base_url(req: Request): return f'{str(os.getenv("PROXY_BASE", req.base_url)).rstrip("/")}/{req.scope.get("root_path").lstrip("/")}' # type: ignore [union-attr]

I'm wondering this is because at IA, the Search API service is reached via a reverse proxy (ie; a load balancer and/or nginx), and they want to make sure all returned URLs are valid externally (or at least that all URLs also go thru the proxy).

While this isn't needed in our current configuration, it seems (to me) an inexpensive layer of indirection, and given that the project is only beginning to look at performance, one that might be a useful tool to have built in.

Looking at the implementation:

I suppose it might be slightly cheaper to do:

PROXY_BASE = os.getenv("PROXY_BASE") or ""

At top level, and then use it via

(PROXY_BASE or req.base_url)

I imagine the str() call is there to quiet mypy complaints, and might be avoided in the above formulation.

After that, if the "# type: ignore [union-attr]" is still needed, I'd prefer to see it applied to the simplest/smallest bit of code!

In my mind anyway, the "implementation" issues immediately above (ie; the results of quick and dirty suppression of mypy complains?) is a riper orchard for improvement than removing the PROXY_BASE variable.

Given our preference for specifying configuration via the environment (rather than baking a config.yml file into the image), the config file stuff seems like excess baggage that could be removed, but again, one person's trash is another person's treasure!

pgulley commented 3 weeks ago

Okay fair! I guess I could imagine a use-case in the context of a reverse proxy!

Just investigating the original context, in the blame history I see that this line 225: "article_url": f"{base}/{collection}/article/{hit['_id']}" is the only place that it ever seemed to get used- and that specific "article_url" feature doesn't exist anymore.

pgulley commented 2 weeks ago

Thinking long term, at this point my preference is to just take it out. Since the value is computed but never put to use right now (it's passed to format_match but never actually used) I'm not sure what we lose by removing it- and we gain some better separation of concerns between the two new components of the api.