Closed rnewman closed 9 years ago
Hi, this is probably due to nginx not allowing enough time for the client to upload its records.
I believe this was due to having too big content. Do you know exactly which request is causing this?
503 for a single record, and also for downloading it.
03-05 15:58:38.176 I/FxReadingList(24758): fennec_rnewman :: ReadingListClient :: Uploading new record: {"is_article":false,"added_on":1425584258166,"resolved_title":"A study of twins shows that autism is largely genetic","added_by":"Fennec rnewman on Nexus 7","read_position":0,"url":"http:\/\/loonylabs.org\/2015\/03\/04\/autism-spectrum-disorder-twins-genetics\/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+ResearchBloggingNeuroscienceEnglish+(Research+Blogging+-+English+-+Neuroscience)","title":"A study of twins shows that autism is largely genetic","word_count":0,"excerpt":"In the fight against misinformation about autism it seems science is starting to come out on top, finally. A new study hopes to add to the recent advancements made in the understanding of autism, which finds that a substantial genetic and moderate environmental influences were associated with risk of autism spectrum disorder (ASD) and broader autism\u2026","favorite":false,"resolved_url":"http:\/\/loonylabs.org\/2015\/03\/04\/autism-spectrum-disorder-twins-genetics\/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+ResearchBloggingNeuroscienceEnglish+(Research+Blogging+-+English+-+Neuroscience)","stored_on":null,"_id":5,"archived":false,"unread":true}
03-05 15:58:38.177 D/FxReadingList(24758): fennec_rnewman :: BaseResource :: HTTP POST https://readinglist.dev.mozaws.net/v1/articles
03-05 15:58:38.177 D/FxReadingList(24758): fennec_rnewman :: BaseResource :: Added auth header.
03-05 15:58:38.182 D/FxReadingList(24758): fennec_rnewman :: BaseResource :: I/O exception returned from execute.
03-05 15:58:38.182 D/FxReadingList(24758): fennec_rnewman :: BaseResource :: Retrying request...
03-05 15:58:38.674 D/FxReadingList(24758): fennec_rnewman :: BaseResource :: Response: HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
03-05 15:58:38.674 W/FxReadingList(24758): fennec_rnewman :: ReadingListClient :: Upload got failure response HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
03-05 15:58:38.677 D/FxReadingList(24758): fennec_rnewman :: ReadingListClient :: No response body.
03-05 15:58:38.677 D/FxReadingList(24758): fennec_rnewman :: ReadingListSynchronizer :: New items uploaded. Flushing resultant changes.
03-05 15:58:38.679 D/FxReadingList(24758): fennec_rnewman :: ReadingListSyncAdapter :: Step: onNewItemUploadComplete
03-05 15:58:38.680 I/FxReadingList(24758): fennec_rnewman :: ReadingListClient :: Getting all records from https://readinglist.dev.mozaws.net/v1/articles?_since=1425582942146
03-05 15:58:38.680 D/FxReadingList(24758): fennec_rnewman :: BaseResource :: HTTP GET https://readinglist.dev.mozaws.net/v1/articles?_since=1425582942146
03-05 15:58:38.680 D/FxReadingList(24758): fennec_rnewman :: BaseResource :: Added auth header.
03-05 15:58:38.728 D/FxReadingList(24758): fennec_rnewman :: BaseResource :: Response: HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
03-05 15:58:38.728 D/FxReadingList(24758): fennec_rnewman :: ReadingListClient :: Got non-success record response 503
03-05 15:58:38.729 D/FxReadingList(24758): fennec_rnewman :: ReadingListClient :: No response body.
03-05 15:58:38.729 D/FxReadingList(24758): fennec_rnewman :: ReadingListClient :: No response body.
03-05 15:58:38.731 E/FxReadingList(24758): fennec_rnewman :: ReadingListSynchronizer :: Download failed. since = 1425582942146. Response: 503
504 or 503 ? Is it two different issues?
Two. I was getting a 504 (with a timeout) for an hour or so, then by the time I'd added more logging after going to the gym, it had turned into an instant 503.
It seems to be alive again.
@ametaireau, can you add @jrgm's public key to that server so we can provide "extended support hours" for that server?
https://github.com/mozilla/identity-pubkeys/blob/master/jrgm.pub
Yeah, I've reset nginx in there, not sure what was hapenning actually, I'll be looking at the logs there tomorrow.
@ckarlof thanks, I've added @jrgm public key there.
You know where to connect?
@ametaireau is it ec2-54-149-21-166.us-west-2.compute.amazonaws.com?
@ametaireau what user name does he need to log into?
In my .ssh/config
:
Host loop-dev
HostName ec2-54-68-145-165.us-west-2.compute.amazonaws.com
User ec2-user
Thanks!
Github chat ftw.
I'm letting this issue open until we find out what's going on here. I believe the 504 caused the 503 because it somehow killed our app, not sure. Will check tomorrow.
Actually, the ssh-config isn't the right one, this is loop-dev, not readinglist-dev.
the right one:
Host readinglist-dev
User ubuntu
HostName 54.149.21.166
Now I'm back to a 504.
03-05 16:23:53.251 I/FxReadingList(26251): fennec_rnewman :: ReadingListSyncAdapter :: Reading list sync done.
03-05 16:23:53.399 D/FxReadingList(26251): fennec_rnewman :: BaseResource :: Response: HTTP/1.1 504 Gateway Timeout
03-05 16:23:53.399 W/FxReadingList(26251): fennec_rnewman :: ReadingListClient :: Upload got failure response HTTP/1.1 504 Gateway Timeout
03-05 16:23:53.455 D/FxReadingList(26251): fennec_rnewman :: ReadingListClient :: No response body.
Looks like a 60-second timeout.
What happended is that we only have a circus worker at the time so if your request is too consuming it will hang the worker until it finish and if the heartbeat is calling it in between it will turn off the server from the ELB as non responding. It means we really need to start scalling the dev server.
@rnewman could you provide us with the number of batch request your are sending?
No batch requests. One HTTP request at a time, uploading a single record via POST
. See my second comment.
Note that the Android client doesn't use batching at all. Stefan's does.
Ok I have tracked down the problem. The readinglist dev box is really really small. (1 proc and have serveur, nginx and database running on it.) I will take some time to deploy a bigger instance.
We have redeployed everything and have been working to increase the perf of each node. You shouldn't have more problem on the production server.
Closing for now. I will get even better with the 1.4.x release.
Just saw this on device, so I don't have a req/resp snapshot for you.
504, no useful body. Should be a bunch in your logs. The device is still on, so will keep reproing for a couple of hours!