Open AlmightyOatmeal opened 8 years ago
Thank you for your thorough issue report. The problem is entirely with Flask-Restless, not your database configuration.
1-2 minutes is proposterous for a query of only 10 results. I'm a firm believer, and advocate, of flask-restless so I will do anything within my power to help investigate.
The performance is atrocious because I had no idea what I was doing when I started this project, and before long people it had become used by way more people than I had ever anticipated :).
The includes and excludes don't modify the query at all so what's the point of specifying what columns you want when flask-restless queries everything under the sun?
This is a long-standing issue with Flask-Restless. The includes/excludes are applied at serialization time, not at query time. If you see an easy way to fix this, it would be a welcome pull request.
Along the lines of the aforementioned, how can I disable traversing relationships? Pre/post processing isn't going to help much here and includes/excludes are useless for modifying the query but modifying the database layout isn't going to help much considering there will still be a relationship present that will be followed (at least one level deep).
This is sort of the same as the problem in the first bullet point above: you want to specify includes/excludes at the database level. Is that right?
If one is going to follow the relationship regardless, why only post the object ID and reference URL and not return any other data for the relationship object?
This is so that we meet the requirements of the JSON API specification.
If posts are going to have a many-to-many relationship and have a "tags" array of tags, then why not just allow me to return the tag.name? Otherwise I will need to use REQ * NUM_TAGS calls back to the API or use the tag endpoint and search based on parameters of each post in the results -- neither are good when the work has already been half-way done.
This is certainly a tradeoff that comes with the structure provided by the JSON API specification. Maybe what you're looking for is Flask-Restless to treat the many-to-many relationship for tags
via an association proxy as a list attribute of the post
model instead of as a relationship. I would like to support that (see the failing test https://github.com/jfinkels/flask-restless/blob/master/tests/test_fetching.py#L1615) but don't know enough about database configuration to be able to write the tests for it. See issue #480.
I would like to be able to work on applying includes/excludes at the database query level instead of at the serialization level, but it would take a fair bit of work and I don't have the time at the moment. But this is definitely a high priority issue. Let me know how difficult you think it would be to solve this.
@jfinkels, thank you for the in-depth response. I would have responded sooner but I wanted my response to be equally as useful.
The performance is atrocious because I had no idea what I was doing when I started this project, and before long people it had become used by way more people than I had ever anticipated :).
Kudos to you for starting something that has become a favorite amongst people like me! :)
I would like to support that (see the failing test https://github.com/jfinkels/flask-restless/blob/master/tests/test_fetching.py#L1615)
Scalar is invoking
I think there should be some configurable logic added here to:
And that shouldn't be too difficult to do, with or without using SQLAlchemy's inspector (I'm trying to think of any overhead reduction).
I would like to be able to work on applying includes/excludes at the database query level instead of at the serialization level, but it would take a fair bit of work and I don't have the time at the moment. But this is definitely a high priority issue.
I think performance would be a higher priority, but that's just my humble opinion. Using includes and excludes to modify a query should be relatively simple; for example, I wrote this POC that takes a list of strings, as column names, to exclude form my query:
# -*- coding: utf-8 -*-
from app import db
from app.mod_tumbleweed import tumbleweed_db
def primary_keys(obj):
return [k.name for k in obj.__mapper__.primary_key]
def col_map(obj):
return obj.__mapper__.c.keys()
def generate_query(obj, excludes=None, includes=None):
pri_keys = primary_keys(obj)
queryable_attrs = col_map(obj)
if not excludes and not includes:
return db.session.query(obj).limit(10)
attrs_to_query = includes if isinstance(includes, list) else []
for query_attr in queryable_attrs:
in_exc = (True if query_attr in excludes else False) if excludes else False
in_inc = (True if query_attr in includes else False) if includes else False
in_atq = True if query_attr in attrs_to_query else False
if in_exc and in_atq:
raise AttributeError('{} is both excluded and included, you make sad panda cry.'.format(query_attr))
# Keep the primary key(s) no matter what (?)
if in_exc and query_attr not in pri_keys:
continue
attrs_to_query.append(query_attr)
mmm = [(obj.__mapper__.c.get(c) if obj.__mapper__.c.get(c) is not None else obj.__mapper__._props.get(c))
for c in attrs_to_query]
### A little more human-readable version of the aforementioned:
# mmm = []
# for c in attrs_to_query:
# print('c: {}'.format(c))
# if obj.__mapper__.c.get(c) is not None:
# mmm.append(
# obj.__mapper__.c.get(c)
# )
# continue
# mmm.append(
# # obj.__mapper__.class_manager.get(c)
# obj.__mapper__._props.get(c)
# )
# return db.session.query(*mmm).limit(10)
return generate_results(db.session.query(*mmm).limit(10), attrs_to_query)
def generate_results(query, cols):
for result in query:
yield dict(zip(cols, result))
def test_excludes(test_obj, excludes=None):
print('=> Excludes')
for result in generate_query(test_obj, excludes):
print(result)
print('')
return None
def test_normal(test_obj):
print('=> Normal')
for result in generate_query(test_obj):
print(result)
print('')
return None
def main():
test_obj = tumbleweed_db.TumbleweedPost
test_normal(test_obj)
test_excludes(test_obj, ['thumbnail', 'thumbnail_id', 'tags'])
if __name__ == '__main__':
main()
(NOTE: I was planning to incorporate "includes" as well but I haven't gotten around to that yet)
The query for test_normal() was:
2016-07-07 10:20:50,267 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2016-07-07 10:20:50,268 INFO sqlalchemy.engine.base.Engine SELECT tumbleweed_post.id AS tumbleweed_post_id, tumbleweed_post.post_id AS tumbleweed_post_post_id, tumbleweed_post.blog_source AS tumbleweed_post_blog_source, tumbleweed_post.blog_found AS tumbleweed_post_blog_found, tumbleweed_post.downloaded AS tumbleweed_post_downloaded, tumbleweed_post.adult AS tumbleweed_post_adult, tumbleweed_post.file_id AS tumbleweed_post_file_id, tumbleweed_post.thumbnail_id AS tumbleweed_post_thumbnail_id, tumbleweed_post.json_id AS tumbleweed_post_json_id
FROM tumbleweed_post
LIMIT %s
2016-07-07 10:20:50,268 INFO sqlalchemy.engine.base.Engine (10,)
And the query for test_excludes() was:
2016-07-07 10:20:50,301 INFO sqlalchemy.engine.base.Engine SELECT tumbleweed_post.id AS tumbleweed_post_id, tumbleweed_post.post_id AS tumbleweed_post_post_id, tumbleweed_post.blog_source AS tumbleweed_post_blog_source, tumbleweed_post.blog_found AS tumbleweed_post_blog_found, tumbleweed_post.downloaded AS tumbleweed_post_downloaded, tumbleweed_post.adult AS tumbleweed_post_adult, tumbleweed_post.file_id AS tumbleweed_post_file_id, tumbleweed_post.json_id AS tumbleweed_post_json_id
FROM tumbleweed_post
LIMIT %s
2016-07-07 10:20:50,301 INFO sqlalchemy.engine.base.Engine (10,)
While the results, with excludes, won't necessarily be the objects themselves, I did try to preserve the primary keys so they could be referenced by key instead of trying to do any other kind of table scan (or an index scan on a non-pkey index).
As far as the overall performance, this is really the first time I've looked under the skirt of flask-restless but a weekend and a copious supply of Red Bull (blueberry++, mmm) could yield some benefit. :)
Feel free to reach out to me via email or google chat/hangouts at amFtaWUuaXZhbm92QGdtYWlsLmNvbQo= (base64) or amFtaWUuaXZhbm92QGljbG91ZC5jb20K (base64) (both work with iMessage) and my GitHub handle is where you can find me on Freenode as well.
What I'm trying to figure out is why I'm getting 683,672 calls to views/base.py->get_all_inclusions() because line 1465 is where my problem is:
to_include = set(chain(map(self.resources_to_include, instances)))
Which happens to be the number of rows in the table:
mysql> select count(*) from tumbleweed_post;
+----------+
| count(*) |
+----------+
| 683672 |
+----------+
1 row in set (0.26 sec)
When there is a default pagination, why is flask-restless querying ALL objects in my table when, by default, it returns only 10 results? What should be happening is querying the first 10 results, then when the next page is queried then both an offset and limit are used.
I'm still rooting through the code to map everything out so I figured I would bring that to @jfinkels attention to see if someone who knows the code layout would have a better idea of why that's going on until I finish figuring out the code layout.
The performance is atrocious because I had no idea what I was doing when I started this project, and before long people it had become used by way more people than I had ever anticipated :).
:+1: Yes kudos indeed, I've looked at a few flask-api packages and this one looked like the easiest to configure and use. I do agree with @AlmightyOatmeal I think the performance is a high priority. I have a few tables of about 27000 records and it takes around 5-8 seconds for the 10 item paginated result to be retreived to a web browser, meanwhile the server cpu is consumed at 100% processing (single core) :(
I really don't know that much about python to contribute but I am learning so perhaps I will dive in one of these weekends. :)
@roemhildtg, welcome to Python :)
If you're interested, you can do what I did with profiling my Flask application take a look at: http://werkzeug.pocoo.org/docs/0.11/contrib/profiler/
There are limitations to that and it's not as detailed as something like line_profiler. I'm working on a module that can hook up to requests in a similar fashion as the aforementioned so that one can use any profiler they want (pprofile looks promising).
Feel free to use my contact information from above if you want to reach me out-of-band (seeing a bug is not the best place for casual chat).
@roemhildtg, @AlmightyOatmeal
Sorry to derail this issue, but have you seen flask-debugtoolbar? It's quite nice, profile in your browser and you can see all SQL queries.
On a slightly more on-topic note, +1 @jfinkels for starting this project, it's great. I may have some time to help out with this issue.
@uSpike, I've heard of it but the information on using it is sparse and doesn't seem to apply if I'm not using a jinja template at any endpoint (it's purely a restful API).
@jfinkels @roemhildtg I created an add-on for werkzeug that uses pprofile to profile requests line-by-line. I'm also making a large number of changes to pprofile as well to better support integrating and mimc cProfile methods for added support of the stats module. I attached the full output (~9MB) of my pprofile add-on and I should have the pull requests for the other module's updates done this weekend.
@jfinkels, I'm thinking the problem could lay base.py/within _get_collection_helper(). I noticed that when I hit this line:
included = self.get_all_inclusions(instances)
That 'instances' is a base query to the model so when self.get_all_inclusions(instances)
is called, it's executing:
to_include = set(chain(map(self.resources_to_include, instances)))
(by the way, it's kind of defeating to chain() a map() in a set())
Against a query that is no .limit() applied to it, ergo all items in the database are queried even though there are only supposed to be 10 (by default).
There are a number of times where objects are renamed so it's a little difficult to follow but I think a lot of those methods could/should be broken down to separate functions or methods, if applicable, to help clean-up the code, make it easier to test, and by far easier to read.
Due to the complexity of base.py/within _get_collection_helper(), I'm not sure when I'll hit the nail on the head.
Oh, and the driver has no default limit for the get() function.
in base.py/get_collection_helper(), I commented out the following lines:
# Include any requested resources in a compound document.
# try:
# included = self.get_all_inclusions(instances)
# except MultipleExceptions as e:
# # By the way we defined `get_all_inclusions()`, we are
# # guaranteed that each of the underlying exceptions is a
# # `SerializationException`. Thus we can use
# # `errors_from_serialization_exception()`.
# return errors_from_serialization_exceptions(e.exceptions, included=True)
if 'included' not in result:
result['included'] = []
# result['included'].extend(included)
I get a response that takes approximately 536ms which is certainly better than 1-5 minutes. My question is: what is the purpose of this and why is it trying to map my entire table?
My question is: what is the purpose of this and why is it trying to map my entire table?
See http://flask-restless.readthedocs.io/en/latest/includes.html. Basically, if a user requests GET /comments?include=author
then the author of each comment will be included in the included
element of a compound JSON API document.
@jfinkels but with no includes specified, there should be no reason that the logic is being used. Also, it should not be scanning the entire table especially when there is nothing more to include than what was already in the model. This is where the performance hit is coming from.
I would rather not have that functionality then cause unnecessary delays. I think there is a problem in how the logic is being applied.
I tested out the code change, and while it may not be the correct solution to the problem, since it removes some of the json api specification and functionality, it does appear that therein lies the issue. My queries on big tables went from 5 seconds to miliseconds when I commented out the section that @AlmightyOatmeal did in the pull-request.
I'm having an issue similar to @AlmightyOatmeal's. I'm querying a large table (call it "posts") for a very tiny subset of records based on a many-to-many relationship with another table (call it "tags"). To be exact, I'm using this filter:
{
name: "tags",
op: "any",
val: {
name: "name",
op: "eq",
val: "tag_name"
}
}
From the database perspective, this restless query results in the following db query to count the results:
SELECT count(*) AS count_1
FROM posts
WHERE EXISTS (SELECT 1
FROM posts_tag_mapping, posts_tags
WHERE posts.id = posts_tag_mapping.post_id AND posts_tags.id = posts_tag_map.tag_id AND posts_tags.name = 'tag_name')
I haven't dug into the code to figure out what is generating this query, but changing the subquery to a join speeds things up significantly.
Hi @spight! Thanks for the report, but that seems to be a separate issue, unrelated to the one being discussed here. Please open a new bug with the info above along with the SQL query you expect to see when filtering on a relation.
As for the original issue, (i.e. scanning the whole database for includes), I believe there should be a relatively straightforward way fix for getting the included resources more efficiently, and I'd like to include a fix for it before releasing version 1.0.0b2. Is it as simple as disabling a check for included resources when none are requested?
Anything going on here? We are hitting the same issue…
same here
We migrated all codebases to Flask-REST-JSONAPI by now. It's worth the effort.
@Natureshadow, what about the performance? Can you describe the issue with Flask-Restless and how it's going now with Flask-REST-JSONAPI?
I had a look to Flask-REST-JSONAPI, it seems very nice. Although flexible, I think JSON API introduces a lot of overhead for programming the UI.
I'm running into a serious performance problem with flask-restless that is making me look at alternatives and that breaks my heart. The posts table contains almost 700k "posts" for a blog with one-to-one mappings for raw JSON blobs and a many-to-many for tags; with getting only 10 results for each page, the request is taking 1-2 minutes!
I've already been through the database thinking that was an issue but it most certainly isn't. MySQL returns results in a second or two. Then I thought maybe it was SQLAlchemy so I went through that with a fine tooth comb but that wasn't the problem either; I can use flask-restful, flask-potion, or sqlalchemy by itself to generate the exact same result set in roughly 5 seconds or less.
The blog posts model:
The endpoint (all model endpoints are essentially the same for the time being due to testing):
The request:
And the logs:
I've tried dozens of permutations of model relationship adjustments which have had a little impact on the issue at hand. I'm really curious to where "{map}" is being called from; map() is used in a number of locations within flask-restless so I may need to do a little more performance profiling there.
But:
If someone has some ideas then I am most certainly all ears! I'm hoping to get this knocked out relatively soon! Cheers!