mozilla-services / syncserver

Run-Your-Own Firefox Sync Server
Mozilla Public License 2.0
1.87k stars 145 forks source link

400 Bad Request errors during POST operations #96

Open CRCulver opened 7 years ago

CRCulver commented 7 years ago

I am running syncserver on my server and I am experiencing an error in 1.6.0 (which persists after cloning the latest git version) in which POST operations within sync'ing fail with a 400 "Bad Request" error, though GET requests succeed. My logs look like this:

- - [10/Sep/2017:00:03:57 +0000] "GET /storage/1.5/1/info/collections HTTP/1.1" 200 3879 "-" "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop" - - [10/Sep/2017:00:03:58 +0000] "GET /storage/1.5/1/info/configuration HTTP/1.1" 200 461 "-" "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop" - - [10/Sep/2017:00:03:59 +0000] "GET /storage/1.5/1/storage/clients?full=1&limit=1000 HTTP/1.1" 200 1545 "-" "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop" - - [10/Sep/2017:00:04:07 +0000] "POST /storage/1.5/1/storage/bookmarks?batch=true HTTP/1.1" 400 336 "-" "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop" - - [10/Sep/2017:00:04:39 +0000] "GET /token/1.0/sync/1.5 HTTP/1.1" 200 1058 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0" - - [10/Sep/2017:00:04:50 +0000] "POST /storage/1.5/1/storage/history?batch=true HTTP/1.1" 400 336 "-" "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop" I wondered if this was due to some problem with my DB file on the server, so I first disconnected Firefox from sync and then deleted the DB file. However, after restarting the server and signing Firefox in to sync again, the DB file is recreated by syncserver, but these POST operations continue to fail. Sync used to be working normally and I haven't touched my configuration in a long while, so I’m not sure if I broke something (or if upgrading to the latest Debian stable did). How can better debugging be gained from syncserver, so that I can understand why these POST operations fail, but not any of the GET operations?
rfk commented 7 years ago

Hrm, this:

POST /storage/1.5/1/storage/bookmarks?batch=true

Looks like it's trying to use the new "batch upload" API, but that's not supposed to be enabled in the self-hosted version yet. It should work, I'm just surprised to see it on. Can you try adding a [storage] section to your .ini file, and including batch_upload_enabled = False to it?

You may be able to get useful debugging info from your Firefox sync logs, by going to about:sync-log. They often contain more details about e.g. the contents of a specific failing request.

How can better debugging be gained from syncserver

Unfortunately this can be very dependent on how you're running the server. The application has some calls to logging.info and logging.error and so-on, but getting them to output to stderr and then having stderr go somewhere useful depends a lot on the environment. How are you running the server?

CRCulver commented 7 years ago

Looks like it's trying to use the new "batch upload" API ... Can you try adding a [storage] section to your .ini file, and including batch_upload_enabled = False to it?

I added that section to my .ini file, but there is no change in behaviour: the POST request is still made by the client with a batch=true argument and it fails with a 400 Bad Request:

` - - [10/Sep/2017:22:59:35 +0000] "POST /storage/1.5/1/storage/bookmarks?batch=true HTTP/1.1" 400 336 "-" "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop"

- - [10/Sep/2017:22:59:36 +0000] "POST /storage/1.5/1/storage/history?batch=true HTTP/1.1" 400 488 "-" "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop"` > How are you running the server? WSGI through Apache. With regard to about:sync-log, I don’t see any details readily useful to me in debugging this problem: `1505084540472 Sync.Telemetry TRACE observed weave:engine:sync:start bookmarks 1505084540511 Sync.Engine.Bookmarks INFO 613 outgoing items pre-reconciliation 1505084540512 Sync.Engine.Bookmarks INFO Records: 0 applied, 0 successfully, 0 failed to apply, 0 newly failed to apply, 0 reconciled. 1505084540513 Sync.Telemetry TRACE observed weave:engine:sync:applied bookmarks 1505084540513 Sync.Engine.Bookmarks DEBUG Deleted pending items: [] 1505084541291 Sync.Tracker.Clients DEBUG Saving changed IDs to clients 1505084541511 Sync.Tracker.Bookmarks DEBUG Saving changed IDs to bookmarks 1505084541896 Sync.Engine.Bookmarks INFO Posting 100 records of 66620 bytes with batch=true 1505084541897 Sync.BrowserIDManager DEBUG _ensureValidToken already has one 1505084541899 Sync.Collection DEBUG POST Length: 66620 1505084542902 Sync.Collection DEBUG mesg: POST fail 400 https://engine.christopherculver.com/storage/1.5/1/storage/bookmarks?batch=true 1505084542903 Sync.Collection DEBUG POST fail 400 https://engine.christopherculver.com/storage/1.5/1/storage/bookmarks?batch=true 1505084542903 Sync.Collection WARN POST request to https://engine.christopherculver.com/storage/1.5/1/storage/bookmarks?batch=true failed with status 400`
rfk commented 7 years ago

WSGI through Apache.

I don't have a lot of experience with this setup, but IIUC the way to enable more detailed logging is as follows.

First, I assume you're using the provided syncserver.wsgi file, which contains a hook to configure python's logging module with sections read from the syncserver.ini file:

So you can add options into the .ini file to enable more verbose logging, in the format documented in [1]. Here's an example from one of our build pipeline configs:

If you can configure it to log at DEBUG level and to write to stdout or stderr, then the log output should show up in whatever file you've configured for logging in your mod_wsgi config. Hopefully there will be some clues in there.

[1] https://docs.python.org/2/library/logging.config.html#logging-config-fileformat

CRCulver commented 7 years ago

Thank you for the detailed response. I enabled logging at the DEBUG level as you indicated. Unfortunately, it doesn’t provide me with any information besides some SQL processes, and it doesn’t give any clues to why the POST request should be answered with a 400 Bad Request:

[Mon Sep 11 04:51:19.912654 2017] [wsgi:error] [pid 30669] [remote :4207] Created new connection <sqlite3.Connection object at 0xdf20e818> [Mon Sep 11 04:51:19.912748 2017] [wsgi:error] [pid 30669] [remote :4207] Connection <sqlite3.Connection object at 0xdf20e818> checked out from pool [Mon Sep 11 04:51:19.913851 2017] [wsgi:error] [pid 30669] [remote :4207] Connection <sqlite3.Connection object at 0xdf20e818> being returned to pool [Mon Sep 11 04:51:19.913928 2017] [wsgi:error] [pid 30669] [remote :4207] Connection <sqlite3.Connection object at 0xdf20e818> rollback-on-return [Mon Sep 11 04:51:19.914369 2017] [wsgi:error] [pid 30669] [remote :4207] {"code": 200, "request_time": 0.003838777542114258, "remoteAddressChain": [""], "agent": "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop", "fxa_uid": "663b28813419da737fe26a21227739c0", "syncstorage.storage.sql.pool.get": 4.696846008300781e-05, "syncstorage.storage.sql.db.execute": 0.0008339881896972656, "path": "https://engine.christopherculver.com/storage/1.5/1/info/configuration", "method": "GET", "device_id": "87f798d0ca9342cc226a0e707f5d873d"} [Mon Sep 11 04:51:20.139653 2017] [wsgi:error] [pid 30668] [remote :4207] Connection <sqlite3.Connection object at 0xdf20e818> checked out from pool [Mon Sep 11 04:51:20.142489 2017] [wsgi:error] [pid 30668] [remote :4207] Connection <sqlite3.Connection object at 0xdf20e818> being returned to pool [Mon Sep 11 04:51:20.142588 2017] [wsgi:error] [pid 30668] [remote :4207] Connection <sqlite3.Connection object at 0xdf20e818> rollback-on-return [Mon Sep 11 04:51:20.143118 2017] [wsgi:error] [pid 30668] [remote :4207] {"code": 200, "request_time": 0.004719972610473633, "remoteAddressChain": [""], "agent": "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop", "fxa_uid": "663b28813419da737fe26a21227739c0", "syncstorage.storage.sql.pool.get": 4.291534423828125e-05, "syncstorage.storage.sql.db.execute": 0.001050710678100586, "path": "https://engine.christopherculver.com/storage/1.5/1/storage/clients", "method": "GET", "device_id": "87f798d0ca9342cc226a0e707f5d873d"} [Mon Sep 11 04:51:21.511539 2017] [wsgi:error] [pid 30669] [remote :4207] {"code": 400, "request_time": 0.16726303100585938, "remoteAddressChain": [""], "agent": "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop", "fxa_uid": "663b28813419da737fe26a21227739c0", "path": "https://engine.christopherculver.com/storage/1.5/1/storage/bookmarks", "method": "POST", "device_id": "87f798d0ca9342cc226a0e707f5d873d"} [Mon Sep 11 04:51:23.266140 2017] [wsgi:error] [pid 30668] [remote :14376] {"code": 400, "request_time": 0.38861799240112305, "remoteAddressChain": [""], "agent": "Firefox/52.3.0 (Linux x86_64) FxSync/1.54.0.20170808191219.desktop", "fxa_uid": "663b28813419da737fe26a21227739c0", "path": "https://engine.christopherculver.com/storage/1.5/1/storage/history", "method": "POST", "device_id": "87f798d0ca9342cc226a0e707f5d873d"}

rfk commented 7 years ago

Unfortunately I can't think of a nice way to get more information out of the system here. I would probably fall back to the old "add a bunch of print statements" approach at this point.

If you're interested in going that route, the likely sources for this failure are the extraction and validation of batch-related parameters in this file:

https://github.com/mozilla-services/server-syncstorage/blob/master/syncstorage/views/validators.py

And any errors thrown in the route hander here:

https://github.com/mozilla-services/server-syncstorage/blob/master/syncstorage/views/__init__.py#L362

I'm sorry to have to suggest that, but I'm out of other ideas :-(

virgoparna commented 7 years ago

I'm having same problem with history syncing - log shows batch=true param added to request. Just out of interest: is server somehow supposed to advertise, that it supports batch? Because otherwise how would Firefox know, that it is not supposed to use it. Unfortunately browser console does not show response content for requests, so I checking actual data would take some doing.

rfk commented 7 years ago

is server somehow supposed to advertise, that it supports batch?

Yes, there's a /info/configuration endpoint that clients can fetch to determine various details of the server setup, including whether batch uploads are enabled.

virgoparna commented 7 years ago

Ok, this /info/configuration response was: {"max_request_bytes": 1048576, "max_record_payload_bytes": 262144} Could it be, that Firefox switches to batch sending, if request would be bigger, then those limits? Any way to change them in configuration?

virgoparna commented 7 years ago

From Firefox sync configuration: batch: indicates that uploads should be batched together into a single conceptual update. To begin a new batch pass the string ‘true’. To add more items to an existing batch pass a previously-obtained batch identifier. This parameter is ignored by servers that do not support batching.

But in syncserver validators.py:

If batch uploads are not enabled in the config then

# we want to:
#  * silently ignore attempts to start a new batch, which
#    will cause clients to fall back to non-batch mode.
#  * error out on attempts to continue an existing batch,
#    since we can't possibly do what the client expects.

It seems to me that there is a conflict between those two things. Unless I'm completely misunderstanding, how it is supposed to work.

rfk commented 7 years ago

On reflection, I think that the appearance of batch=true in the requests here is OK. It's Firefox speculatively sending "batch=true" when it uploads, and then it will determine whether or not batching is available based on the server's response. If the server response indicates that it ignored batch=true then the client will just do a non-batched upload.

So I think the batch=true in the requests is probably a red herring here after all.

urigg commented 6 years ago

Having the same issue, and now blocks history and bookmarks sync for me. :(

I've found this hint: "If the server does not support batching, it will ignore the batch parameter and return a “200 OK” response without a batch identifier." Source: https://moz-services-docs.readthedocs.io/en/latest/storage/apis-1.5.html

So, even with batch=true, the server should not fail with an 400 bad request.

urigg commented 6 years ago

Found it. Sorta.

Directly below the 400 error, there's a "Uploading records failed: 6". Thats from syncstorage/tweens.py, pick_weave_error_code(). Originally, its an "Invalid JSON in request body" from syncstorage/views/validators.py, parse_multiple_bsos(), because json_loads(request.body) throws an ValueError. I have no idea what the contents of the request body are at that time. (because I have no idea what I'm doing here anyway.)

However, that lead me to the idea that somehow larger POSTs get crippled. I then reduced the maximum POST size (which is by default 2Mb), and found that the magic limit for me is somewhere between 6000 and 10000 bytes. The actual length is in the log file, as "POST Length".

With these settings added to syncserver.ini, syncing was working again:

[storage] max_request_bytes = 6000 max_post_bytes = 6000

The only thing I am sure, its not the apache config option LimitRequestBody, that one causes a 413 error.

virgoparna commented 6 years ago

Those limits fixed for me too.

rfk commented 6 years ago

However, that lead me to the idea that somehow larger POSTs get crippled. I then reduced the maximum POST size (which is by default 2Mb), and found that the magic limit for me is somewhere between 6000 and 10000 bytes. The actual length is in the log file, as "POST Length".

Aha, interesting, thanks for digging into this @urigg! I wonder if something in the stack is truncating incoming request bodies when they're too long, which produces invalid JSON, which triggers the 400 error.

If I had to guess, I'd guess that the magic limit here is 8kB. What do you think about us configuring default values of max_request_bytes and max_post_bytes in this repo, so that they fall just below that amount?

rfk commented 6 years ago

Different software stack, but very similar-sounding issue reported here:

https://stackoverflow.com/questions/48265984/request-body-truncated-at-8k

rfk commented 6 years ago

Also, are you using mod_default or similar in your apache config? I see a couple of old reports that this can produce truncated request bodies in various combinations, we might be facing something similar here.

urigg commented 6 years ago

I'm not sure how many installations actually have this issue, so no idea if its worth changing the defaults. Plus, I don't know whether its max_request_bytes or max_post_bytes that is helping here. As for mod_default, I don't even have a clue what that is, and couldn't find anything about it. I'm using Apache on Debian, with mod_wsgi.

Is there a way to include scenarios like this in automated tests? Right now I probably cannot even reproduce it any more, now that the blocking chunks were finally uploaded.

As for the truncated body: I tried to dump out whats in the body, but always got an 500 error, so maybe there wasn't even anything in there. One time I got an exception from deep inside the request that the body was X bytes too short, with X being the same as the size in the log files.

rfk commented 6 years ago

As for mod_default, I don't even have a clue what that is, and couldn't find anything about it

Sorry, that was a weird typo on my part - I meant mod_deflate

virgoparna commented 6 years ago

I have mod_deflate enabled on that server.

urigg commented 6 years ago

No deflate for me.

About request body: Not sure, but one call stack left the impression to me that this is some kind of lazy placeholder...? Could it be that the code may run before a huge request is completely received? Maybe there's a bug triggered somehow because the request body is still incomplete. (just speculating.)

livanh commented 6 years ago

I think I'm having the same issue, except that settings those limits seems to have no effect. The log says POST requests are still rather large. error-sync-1533569154562.txt I have the same setup as @CRCulver, self-hosted server behind apache with WSGI. Interestingly, on the same server, Firefox Mobile syncs just fine (it only has 3 default bookmarks as of now). Any suggestion?

livanh commented 6 years ago

Today I tried again and it worked, but I don't know which one of these made the trick:

In the end, the workaround works, but the root cause remains unknown

m8mble commented 5 years ago

Similar problem here. I was able to mitigate it partially using

max_request_bytes = 8192
max_post_bytes = 8192

as proposed before.

But this doesn't resolve all errors during syncing. Now I'm running into the following client messages:

1549472918635   Sync.Engine.Prefs   WARN    Failed to enqueue record "e2VjODAzMGY3LWMyMGEtNDY0Zi05YjBlLTEzYTNhOWU5NzM4NH0=" (aborting): Error: Single record too large to submit to server (resource://services-sync/record.js:993:40) JS Stack trace: enqueue@record.js:993:40 _uploadOutgoing@engines.js:1759:43
1549472918636   Sync.Status DEBUG   Status for engine prefs: error.engine.reason.unknown_fail
1549472918636   Sync.Status DEBUG   Status.service: success.status_ok => error.sync.failed_partial
1549472918636   Sync.ErrorHandler   DEBUG   prefs failed: Error: Single record too large to submit to server (resource://services-sync/record.js:993:40) JS Stack trace: enqueue@record.js:993:40 _uploadOutgoing@engines.js:1759:43

Any suggestions on an improved workaround for the original issue?

rfk commented 5 years ago

1549472918636 Sync.Status DEBUG Status for engine prefs: error.engine.reason.unknown_fail

I'm surprised to see a "record too large" error for the prefs engine, but perhaps you've got an addon or something that's storing a large amount of data in synced prefs.

Unfortunately I think the bug here is upstream of syncserver, in whatever software you're using to proxy incoming requests. You could try connecting directly to the python server to confirm whether that fixes the issue, but I don't have any suggestions for how to tweak the rest of your server stack to avoid it :-(