sayanroyc / Spartan_Server

0 stars 0 forks source link

Search (Users, Items..) #29

Closed sayanroyc closed 8 years ago

sayanroyc commented 8 years ago

When user and item database becomes very large, queries need to be made much more efficient. This will be a problem some time in the future. Find an efficient method of returning queries. MapReduce?

nickgarfield commented 8 years ago

I think this is something pretty far off. For example, even if we got half of the US population (150,000,000 people) listing 5 items each, all on one data store, that's about 750 million items. Most of the time, the client actually has the id of the entity it needs which I believe is constant look up time using the get_by_id method (I haven't confirmed that but that seems like it should be the case). In the case that it doesn't, assuming a worst case search of O(n) to, for example, find all of a user's items shouldn't take an unreasonable amount of time I believe. I could be wrong but asking datastore to search through ~750 million entities might take at most a couple seconds. And this only occurs at login time which is a reasonable amount of time to wait for logging in

nickgarfield commented 8 years ago

Further, I think there is some logic we can write which determines which actual geographic data center we want our data hosted at. If this gets to be a problem, we could store data at multiple data centers reducing the number of entities stored and needed to be searched through on any one them

sayanroyc commented 8 years ago

I'm thinking of the case where the user is searching for an item by typing keywords and we want to return say 10 possible items. You can't filter with substrings, so my current process is: 1) query db and return some large number of items (let's call this ItemsFetchedPerQueryCycle) 2) search item name/descriptions for the substring 3) once 10 matches are found, return them 4) if 10 matches are not yet found, query db for another ItemsFetchedPerQueryCycle 5) repeat until 10 found

nickgarfield commented 8 years ago

Can you use these query strings?

https://cloud.google.com/appengine/docs/python/search/query_strings

nickgarfield commented 8 years ago

I keep accidentally tapping "close and comment"

sayanroyc commented 8 years ago

We're using ndb (Datastore), not the Search API

sayanroyc commented 8 years ago

http://stackoverflow.com/questions/23317280/appengine-search-api-vs-datastore

One solution offered is to combine the two because the Search API is easy for querying for pieces of strings and geolocations, while datastore holds the rest of the data

nickgarfield commented 8 years ago

Wow.. Just learned a couple things.

For one, we could do basic text search using just the beginning of the term according to this post http://stackoverflow.com/questions/17702958/ndb-querying-results-that-start-with-a-string?rq=1

Two: We shouldn't use blobstore for images. It's a superseded storage option (like db is compared to nab). https://cloud.google.com/appengine/docs/python/storage. One option we have in place of blobstore might be this Images API https://cloud.google.com/appengine/docs/python/images/ but I haven't looked into this

Three: After reading to your last post and this one: http://stackoverflow.com/questions/26926701/ndb-querying-results-that-contain-a-string, I was beginning to wonder if Datastore is even the right choice for storage at all. The other option would be to use Google Cloud Storage. But this doesn't seem to fit our application needs. Google Cloud Storage appears to be for large files like documents, movies, videos, etc. In our case, most of our entities are just small collections of numbers and strings.

In order to use the Search API, we would need to store just the item names and item descriptions on Google Cloud Storage and everything else on Datastore. I feel like this would quickly get pretty complicated and confusing?

The simplest solution seems to be attempting a text search using the beginning of the item name as described in the first link

nickgarfield commented 8 years ago

Also, StringProperties like the item name are supposed to be indexed. Part of that means the string is tokenized and split according to whitespace, punctuation, etc. So, I could be wrong, but I think if the item name is "Dyson Vacuum" and you search with a query matching item names to "Vacuum", the item with the name "Dyson Vacuum" should be returned.

sayanroyc commented 8 years ago

Indexing does not tokenize or split the string. Indexing only allows us to order by that attribute.

sayanroyc commented 8 years ago

I don't really think the solution is as easy as your simple solution. If we do that, searching for "vacuum" won't return "Dyson vacuum". Also, I want to search the item's description too, not just the names.

As for our storage dilemma, check this out: https://cloud.google.com/docs/storing-your-data

nickgarfield commented 8 years ago

So according to that link, datastore is still the best choice for our application?

One thing we could do is add a repeated StringProperty to each item called key_word and then tokenize the name and description to create key_words. We can then query by matching keywords. I think this would satisfy what you're trying to do

https://cloud.google.com/appengine/docs/python/ndb/queries#repeated_properties

To reduce the number of redundant keywords, we could create a list of redundant words like (a, an, the, this, that, etc.) that we don't add as key_words.

sayanroyc commented 8 years ago

I think the geopt property in ndb is relatively useless, there are no functions to calculate/query for all items within a radial distance. In ndb, the geopt property is stored as (lat, lon), and when indexed, it is organized by lat first and then lon. Converting miles to latitude/longitude looks like a bitch when I googled it.

So Search API looks more and more useful because it contains a distance query that can return all items within a certain distance (kilometers, which can easily be converted to miles). It might be a good idea to store searchable text fields (item name and maybe description) and geolocations should be stored with Search, everything else put into Datastore.

http://stackoverflow.com/questions/13112161/use-the-datastore-ndb-the-search-api-or-both-for-views-on-data

https://cloud.google.com/appengine/docs/python/search/query_strings#Python_Queries_on_geopoint_fields

nickgarfield commented 8 years ago

This seems like it would work on the point about geopoints: http://stackoverflow.com/questions/32941147/indexing-ndb-geopt-property-in-google-app-engine

I just feel really skeptical about storing some data in datastore and other bits in Google Cloud so that we can leverage the Search API on some properties but not others. And then writing more code just to try managing consistency between the two data storage schemes.. Especially since we would be basically splitting up a single entity (the Item class) across two different data storage platforms. It just feels like an overly complex solution considering we have almost everything else working already using datastore.

To me, it seems like using the Search API would make that line of code which creates the query a bit nicer but there's shit ton of other work and cost to make that one line simpler. And it seems like there are other solutions we could try while continuing to use just one data storage scheme.

Those solutions being writing our own code to index the description and title text and storing those key words as a repeated StringProperty. And also indexing the locations as a bounding box like described in the link above to get locational queries

nickgarfield commented 8 years ago

Actually can't we store the longitudal and latitudal points as floats and then query with a bounding box? Like combining "long < p.x+10" AND "long > p.x-10" AND "lat < p.y+10" AND "lat > p.y-10"

https://cloud.google.com/appengine/docs/python/ndb/queries#repeated_properties

sayanroyc commented 8 years ago

Yeah, that's I thought we could do instead. Posting some approximate conversions here:

Length of 1 degree of Longitude = cosine (latitude) * length of degree (miles) at equator 1° Longitude = cos (latitude) * 69.172 mi

length of 1° of latitude = 1° * 69.172 miles = 69.172 miles

http://www.colorado.edu/geography/gcraft/warmup/aquifer/html/distance.html http://geography.about.com/library/faq/blqzdistancedegree.htm

sayanroyc commented 8 years ago

Okay, so I've been messing with the returning the items within a bounded box. A GeoPt object has variables "lat" and "lon", but you cannot query based on them because they are not part of the "Item" class. I then ditched GeoPt and tried using two separate variables for lat and lon, but you are not allowed to have inequality filters on multiple properties.

One hacky way to do it would be to create two separate queries, one for items with their latitude within our desired radius and one for items with their longitude within our radius, then return any items that appear in both of those queries.

sayanroyc commented 8 years ago

Solution for returning n amount of matching items at a time, then iterating through again if the user scrolls down farther: https://cloud.google.com/appengine/docs/python/search/results. Look at the Using Offsets section.

sayanroyc commented 8 years ago

Search function returns a list of ids of all items that contains the desired string. Need to implement taking a random subset of ids and getting the Datastore values and scores. Once a user reaches the bottom of that random subset, go fetch another random subset of the ids.

nickgarfield commented 8 years ago

Duplicate #21