leichter / cwrx

(Old) Cinema6 API
0 stars 0 forks source link

Update host search process for sites #296

Closed leichter closed 9 years ago

leichter commented 9 years ago

Our way of handling sites' host properties should be more robust - we should allow sites to be created with host properties containing subdomains (e.g. different sites for games.usatoday.com and movies.usatoday.com), and we shouldn't make too many assumptions about the structure of the urls (such as assuming the top-level domains will be 2-4 characters).

Worst comes to worst, when searching for the site using the request origin, we can just retrieve all sites from mongo and do multiple searches through the list as necessary. First, we should see if we can cleverly restructure the host property to allow for the kind of flexibility we need.

leichter commented 9 years ago

Hey @sqmunson, @howardengelhart, @minznerjosh, I think I figured out a better way of dealing with the hosts that should allow us the flexibility we need.

When creating a site, proshop (and the site service) should allow any valid hostname (basically, anything that matches /^([\w-]+\.)+[\w-]+$/. This will get saved to the database as the host property.

When attempting to look up the site to fill in the branding and placementId on an experience, the content service will take the current origin, split it by '.', and construct a query to mongo with all the possible subdomains. So for example, for the origin http://foo.bar.baz.com/, we would search mongo for any sites that match {host: 'foo.bar.baz.com'}, {host: 'bar.baz.com'}, or {host: 'baz.com'}. This query will actually be very efficient as long as we're indexing on the host field. The service would then do a quick post-processing step if multiple results are found to choose the site with the longest host property (which would equal the host that most closely matched the current origin).

This should allow us to handle any valid domain name, even if it uses funky top-level domains and country codes. It will also allow us to have sites with specific subdomains and fall backs: so we could have site configs for sports.usatoday.com, movies.usatoday.com, and then a fallback usatoday.com for anything else rooted under their site.

Let me know if you think this will work, or if you can think of any counterexamples that would break this.

minznerjosh commented 9 years ago

:+1:

My only concern would be if there was a compelling reason to have all the subdomain entries stored in the same document for organizational/analytics purposes.

howardengelhart commented 9 years ago

How does the ordering of the query / storage of the data affect things? If at all? For instance, in your example if the db has a site with aaa.bar.baz.com and a site with baz.com, would the indexing insure that baz.com is not returned before aaa.bar.baz.com? Or would the query return both, and the content service work out the logic to pick the right one?

Also.. if you are going to run queries on a site by site basis, then I'd like to be sure that there is some intelligent caching capability to ensure we're not needlessly taxing the db (because there are lots of different MR's / widgets popping up within the same site).

leichter commented 9 years ago

The ordering of the query and storage shouldn't affect things. Mongo would return all sites with hosts that match any of the host options from the query, potentially in a different order each time; so, I'm going to include logic in the content service to pick the correct site from the subset of results returned (the logic I mentioned above, where we pick the site with the longest host, since this indicates the most complete match).

When querying for sites, I'm running it through the QueryCache which will cache results (within the service) based on the unique query sent to mongo; so the origins foo.bar.com and bar.com would both result in different cache entries (in other words, we would cache based on unique hostname). I think that's about as intelligent we can get while guaranteeing correct results. Of course, there's also Cloudfront caching.