Hitting consistent "Failed" status on scheduled posts

pkfrank commented 4 years ago

Hi @toddanglin, Peter from DEV here. Thank you for this amazing tool. It might surprise you, but we actually make use of PublishTo.dev for recurring posts such as our Challenge Threads.

We've been hitting issues with hitting a "Failed" status on scheduled posts. Screenshot below.

Can we possibly help out in any way to troubleshoot the issue with you?

Screenshot

example of failed status

toddanglin commented 4 years ago

Hi @pkfrank! Sorry for the slow follow-up. Busy times. Very happy to hear you've found this utility helpful. :)

Sorry to see you're hitting these sporadic errors. Let me investigate and see if I can isolate what's happening in these cases. Are you aware of any changes to the DEV API? Not likely the problem since some days are still working, but good to rule out.

pkfrank commented 4 years ago

Hey @toddanglin I chatted with a few members of the DEV Team and we think it might be related to rate limiting on our API. Is there any way you can tell us which errors the tool is receiving? That will help us continue looking into this from our side.

toddanglin commented 4 years ago

Okay. Taking a closer look at this now that Build is done.

The errors do not seem be related to rate limiting. The posts that fail to publish consistently have this error:

FetchError: invalid json response body at https://dev.to/api/articles/{id} reason: Unexpected token R in JSON at position 0

Unfortunately, I do not have in the logs exactly what the JSON response was before the parsing error. I will update the logging to capture that in the future.

In the meantime...

It's not immediately clear what is unique about the posts that are failing. One commonality is that all failed posts are created with the v1 editor (so we're trying to update the markdown instead of directly manipulating the JSON). None of the v2 editor posts seem to be failing.

I will take a closer look at the posts you included in the screenshot and see if any differences between the successes and failures emerge.

toddanglin commented 4 years ago

On additional consideration, it could be rate limiting...but unfortunately I was not handling cases where responses from the DevTo API did not return valid JSON...so I was not logging plain text REST error messages. I've added that logging now so I can capture in the future.

Since this utility uses the user's API key to publish, the rate limiting would be more likely if you knew the API keys in question were used heavily. Do you know if the API keys you're using are subject to the rate limiting?

Further, do you know what HTTP error message text is returned when rate limiting (HTTP 429) is hit? If it starts with "R", that increases that odds that's what was happening here.

toddanglin commented 4 years ago

I think I answered my own question. In a quick test tonight, I hit the HTTP 429 which provides the error text: Retry later. That's likely what was happening with your failed posts.

That said, what is being monitored for rate limiting? I would have assumed API key, but it's very unlikely my API key was generating more than 150 requests per 30 seconds. Is there something else that could cause a 429?

In the meantime, Azure Durable Functions does provide an ability to automatically retry an operation after a delay if it fails. I will try adding that to this utility to see if it can be more resilient to intermittent 429 errors.

toddanglin commented 4 years ago

Okay.

Some more testing tonight and it seems the DevTo API is "locking" API keys once they become rate limited. While trying to publish a test post with an old API key, it constantly got 429 errors, even with multiple retries over several minutes. When I tried scheduling the same post with a fresh API key, it published as scheduled the first time.

@pkfrank Can you share more about how the API rate limiting/locking works? I agree that's likely the culprit here, but it seems there may be a problem "releasing" an API key after it gets locked.

mstruve commented 4 years ago

We have a couple different approaches to Rate Limiting all of which can be found in our DEV codebase.

1) We limit write API requests by IP address to 1 request per second. 2) We limit the rate at which users can create articles. This is configurable within the application and we have it currently set to 25 articles in 30 seconds for dev.to 3) We limit the rate at which users can update articles. This is also configurable within the application and we have it currently set to 30 updates in 30 seconds for dev.to.

Hope that helps!

toddanglin commented 4 years ago

Thanks, Molly! That definitely helps.

Based on this, I think the problem could be the first limit, for a couple reasons:

Obviously, all publishing actions for PublishToDev are coming from the same Azure IP (even though different DevTo API keys are being used). It's not likely that this utility is being used enough now to trigger this regularly, but it could happen sporadically if multiple people are trying to publish posts at the same time. I wonder if this could be either A) relaxed, B) whitelist specific IPs, or C) check to see if the API keys are different and make it a combination of API key + IP address?
More likely to tigger this limit is the behavior this utility uses today to deal with posts created with the DevTo "v1" editor. When the utility attempts to publish, it first attempts to publish by directly setting a post's JSON published attribute to true. After this, the utility checks to see if the post shows a published_at time. If the time is still null, it means the post was created with the "v1" editor and to publish we must update the post markdown instead. So a second attempt to publish is made, this time by updating the body_markdown. If these two write operations happen in fast enough succession, it would likely trigger this 1 second limit.

This would generally explain why we mostly see these failures today with post created with the "v1" editor. And it would explain why they sometimes succeed (when the time between the two publishing attempts exceeds 1 second).

Ideally, setting the published attribute to true would publish a post regardless of the DevTo editor used. Absent that, is there another reliable way to detect if a post was created with the v1 editor?

mstruve commented 4 years ago

I wonder if this could be either A) relaxed, B) whitelist specific IPs, or C) check to see if the API keys are different and make it a combination of API key + IP address?

I can do C) and add the API-key to the throttle key we use. In addition to that, I would suggest that adding retry logic to your flow to retry the request in the event of a 429 response.

If these two write operations happen in fast enough succession, it would likely trigger this 1 second limit.

The 1 request/1 second limit is only present on the API, not on our editor through the app so this shouldn't be an issue for any of the editors

toddanglin commented 4 years ago

Cool. Adding the API key to the throttle key will help avoid hard to predict race conditions.

There is retry logic in the flow today, but each retry runs the same flow of trying to update the JSON published attribute first, and then the body_markdown front matter if necessary. That could clearly result in multiple writes in less than 1 second if there is minimal network latency. To further workaround this 1 request/1 second limit, I will investigate a couple of solutions today:

I will re-review the article JSON payload to see if it's possible to detect posts created with the v1 editor before attempting to publish
If there is no reliable way to detect the v1 editor, I will see if it's possible to add a 1 second timeout in the code execution between the attempts to publish to avoid the 429 limit

Thanks again for your input.

mstruve commented 4 years ago

As far as I can tell, the editor version is not something we track at the article level.

toddanglin commented 4 years ago

I think I could detect posts created by the v1 editor by inspecting the body_markdown value. Posts created with the v1 editor should have a value here with front matter, whereas posts with the v2 editor do not.

Reviewing the code flow for this tool, I realize that the reason I don't do this today is because I don't explicitly attempt to request an article's content before publishing. The first time I get an article's full detail is as part of the response for the PUT used to update the published property. I could add an additional API request to get an article's details ahead of time since that is not subject to the same rate limit. 🤔

I'll investigate today. I think I've got what I need to find a solution (one way or another).

toddanglin / PublishToDev

Hitting consistent "Failed" status on scheduled posts #2

Screenshot