lemurheavy / coveralls-public

The public issue tracker for coveralls.io
http://coveralls.io
124 stars 7 forks source link

Coveralls outage ? #1754

Closed jrfnl closed 2 months ago

jrfnl commented 3 months ago

I've just seen the same thing (not) happening in two different repos (both public).

Here are two example builds from different repos where I see the same pattern happening:

jrfnl commented 3 months ago

P.S.: I checked the status page before I wrote the above up and according to that, all should be fine....

image

jeremybanka commented 3 months ago

Can confirm I'm having this issue today also. Recent jobs are stuck in a "Pending Completion" state but there are no API errors or reported outages.

elchininet commented 3 months ago

I had the same but at the end the coverage was reported almost ten hours later:

https://github.com/elchininet/custom-sidebar/pull/81#issuecomment-2036797027

New merge requests are stuck as yesterday:

https://github.com/elchininet/postcss-rtlcss/pull/293

korya commented 3 months ago

We are experiencing the same problem. Any update on this?

jrfnl commented 3 months ago

I can confirm what @elchininet says - looks like there is a backlog queue of builds to be processed. I got the Coveralls coverage comment on one of the PRs I listed above just ten minutes ago, so a good 14 hours after the jobs were run and the coverage reports were submitted.

jrfnl commented 3 months ago

@afinetooth James, sorry for pinging you on this, but could you give us some insight in what is going on and what sort of timeline for resolution we'll be looking at ?

Not trying to put pressure on things, just trying to get some idea of timeline to adjust my own planning (as I won't run releases without a passing build).

jeremybanka commented 3 months ago

I had the same but at the end the coverage was reported almost ten hours later:

https://github.com/elchininet/custom-sidebar/pull/81#issuecomment-2036797027

New merge requests are stuck as yesterday:

https://github.com/elchininet/postcss-rtlcss/pull/293

Same. This morning everything's made it through my queue, but any new jobs are stuck pending, and presumably will be stuck for about ten hours

elchininet commented 3 months ago

At least they have noticed the issue already.

image
elchininet commented 3 months ago

The same, after 10 hours, the MR that I posted before got the Coveralls' response: https://github.com/elchininet/postcss-rtlcss/pull/293

afinetooth commented 3 months ago

All, thank you for your reports. Apologies for the delayed response here, we've been focused on issues reported directly to support@coveralls.io. I just want to encourage everyone here to go ahead and email us at support@coveralls.io even if you have an open source (free) subscription / no subscription.

@jrfnl thanks for the heads up.

I will get an update and circle back on your individual reports here asap.

afinetooth commented 3 months ago

All, we believe we have resolved this issue, which you can read more about here: https://status.coveralls.io/incidents/cj5d1148cz1b

At this point, that means that all previously unfinished builds or delayed status updates should now be completed / received. If that's not the case for you, please let me know.

It's getting late for me, so I'll leave this at the general update, but I'll check back tomorrow to address individual cases (especially if anyone's still having issues).

Thanks for your patience.

jrfnl commented 3 months ago

@afinetooth Thank you for these updates. I imagine it was a stressful day for you all. Get a good rest and thank you for all you do.

elchininet commented 3 months ago

Hi @afinetooth,

Just with the intention that you get notified as soon as possible, because the status page says "ALL SYSTEMS OPERATIONAL". I submitted a PR some minutes ago and finished OK, now a new one from 10 minutes ago, first throwed a 502 calling coveralls and second time it finished OK but the pull request is in a hanging state.

Regards

afinetooth commented 3 months ago

Hi @elchininet. Thanks. I'll have a look.

I was just about to look at individual cases here. I'll start with this one!

elchininet commented 3 months ago

Maybe something that could help you to debug. In the page of the repo, if a refresh it multiple times, sometimes it shows that the task finished and sometimes that it is pending. But it is hanging on Github.

image

image

afinetooth commented 3 months ago

Hi @elchininet. To follow up on this one:

At first, the issue didn't appear to fit yesterday's pattern, because, aside from the 502 (which could have been a network traffic thing from our edge service provider (Cloudflare)), the issue appears to be some missing build info, namely the commit SHA for the PR HEAD, and, as a result, PR Info and a number of other details we get from the GitHub API right after creation.

From my Admin view: https://coveralls.io/builds/66748402

Screenshot 2024-04-05 at 11 00 31 AM Screenshot 2024-04-05 at 11 00 57 AM

I am looking deeper to find out why we couldn't obtain that info.

[...] if a refresh it multiple times, sometimes it shows that the task finished and sometimes that it is pending

Wow! That's a good catch. Never seen that before, but I'm sure that has something to do with it.

Interestingly, the build page itself consistently shows the build is "completed" as opposed to "pending," no matter how many times I refresh it: https://coveralls.io/builds/66748402

Screenshot 2024-04-05 at 11 02 54 AM

So I'm assuming the above issue is a UI logic problem in the RECENT BUILDS section of the Repo Page.

Looking into that as well. BRB.

elchininet commented 3 months ago

the issue appears to be some missing build info, namely the commit SHA for the PR HEAD, and, as a result, PR Info and a number of other details we get from the GitHub API right after creation.

I can try to close the PR and open it again to see if it solves the issue.

Interestingly, the build page itself consistently shows the build is "completed" as opposed to "pending," no matter how many times I refresh it:

In my case it still shows the status in a random way. And it appears here, in the details page of the job I receive the same as you.

By the way, something that I noticed since yesterday and not related to this particular issue. The buttons to open more details in a pending task go to nowhere:

image

image

image

image

afinetooth commented 3 months ago

@elchininet interesting. I haven't been able to refresh the build page and see the "pending completion" state, but I see you're seeing it from the graphic you pasted, so I'll keep trying.

But, yes, in regard to that, I think the UI implementation was incomplete and didn't get a URL under that link, and is currently pulling up some default URL that's not active. I've added that to a ticket to fix asap, but you can ignore for now. It will most likely go to our doc on Parallel Builds, or to a Common Issue.

elchininet commented 3 months ago

Ok, no worries for the UI, it is not important.

Should I close the PR and open it again? Or maybe make a force push with another sha. Let me know if you need the PR in that state for your investigation, I don't need that change to be merged immediately.

afinetooth commented 3 months ago

@elchininet

the issue appears to be some missing build info, namely the commit SHA for the PR HEAD, and, as a result, PR Info and a number of other details we get from the GitHub API right after creation.

I can try to close the PR and open it again to see if it solves the issue.

Don't worry too much about it if nothing changes after a try or two. I am looking into it right now to see what we received in your report, or if there's another reason we couldn't obtain or record it. That will probably turn up the cause.

I suspect it relates to your current CI config, so question for you:

Are you sending us coverage reports for:

elchininet commented 3 months ago

I suspect it relates to your current CI config, so question for you:

Are you sending us coverage reports for:

I use the same config for all my repos. It is the first one: Minimum Recommend CI. push on master and pull_requests.

adamdupuis commented 3 months ago

All, thank you for your reports. Apologies for the delayed response here, we've been focused on issues reported directly to support@coveralls.io. I just want to encourage everyone here to go ahead and email us at support@coveralls.io even if you have an open source (free) subscription / no subscription.

@afinetooth Personally, I am very thankful this issue was posted here as it enabled my team member to find it and realize it's an issue with Coveralls and not just us. Otherwise we would have spent more time investigating, wasting more time (because the status page was not yet updated). Support tickets are great when it's a specific issue, but when it's a general outage, a public post is better.

afinetooth commented 3 months ago

I am attempting to reply to each poster here individually, but I will do it in this one comment for now.

If you have any further feedback or questions, please just reply and we'll continue in our own thread of comments.

Posters in order:

@jrfnl I checked both of your repos above, for PHPCSStandards and Yoast, and both look normal will all recent builds complete and status updates sent to GitHub. Please let me know if you're seeing any other issues. Otherwise, I'll consider your particular cases here resolved. Thanks.

@jeremybanka I reviewed your recent builds and all looks well, and confirmed that your PR has received both PR Comments and Status Update from Coveralls. Please let me know if you're seeing any other issues. Otherwise, I'll consider your particular case here resolved. Thanks.

@elchininet we discussed more above, but I have confirmed you recent builds look normal and that you've received PR comments and status updates on the two PR's you mentioned. Please let me know if you're seeing any other issues. Otherwise, I'll consider your particular case here resolved. Thanks.

@korya You didn't post a particular build or repo, but I checked all of your recent builds for all of your repos and all looks well. Please let me know if that's not the case. Otherwise, I'll consider your particular case here resolved. Thanks.

@adamdupuis Thanks for the feedback. We try our best to respond to public issues here, but often can't get to them for days or weeks due to the support load from paid accounts. That said, we do try to respond to any new public issues when there is an incident and keep all users updated there. In the meantime, I would suggest subscribing to our status page to receive updates about any further incidents. Let me know if you have any remaining issues / cases you'd like me to review, I'd be happy to. Thanks.

elchininet commented 3 months ago

Hi @afinetooth, I have submitted a new PR with the same changes and that one completed successfully, even if Coveralls shows it as pending in a random way as the previous one (now the previous one is all the time resolved). I don't find any explanation for the UI and why the previous one is still stuck 🤷🏼‍♂️ but yes, take my issue as resolved.

image
elchininet commented 3 months ago

Hi @afinetooth, Just a heads up. It started again, jobs from this afternoon are still waiting for completion, a new one from 10 minutes ago is followint the same path. https://coveralls.io/github/elchininet/postcss-rtlcss Regards

afinetooth commented 3 months ago

All, we have published a postmortem on the incident that affected your builds on Wed, Apr 3-Thu, Apr 4: https://status.coveralls.io/incidents/cj5d1148cz1b

Let me know if you have any questions, or thoughts or ideas you'd like to share.

Thanks again for your patience.

afinetooth commented 3 months ago

@elchininet I am looking into your recent builds, but I don't believe your current issues relate to the incident from last week. We believe that incident was fully resolved since last Thu at 8:30pm.

Some of the the behavior does look similar, though.

The similarities are basically the "pending completion" status for some of your builds.

The differences, though, are as follows:

  1. "Pending completion" state, in your case, is different - Your builds are non-parallel, so they should theoretically never be "pending completion," as that should only happen for parallel builds that have not been closed, or that have failed to complete for other reasons. I am not sure right now how your "serial" builds have wound up in this state. I am investigating that.
  2. Your build times are not slow - Your recent builds, even though they are "pending completion," have all completed in 1-second to 1-minute. The incident from last week was caused by underlying backups in background job queues that were resulting in slow build times of up to several hours or more.

For these reasons I think we have a different cause, which I'm still investigating and will feed back on asap.

elchininet commented 3 months ago

Hi @afinetooth,

I really don‘t know if my issues are the same as the indicent of the last week, but the results are the same, my merge requests stay in a hang state waiting for Coveralls. I cannot merge those merge request and solve the open issue in my repo just because of that:

https://github.com/elchininet/postcss-rtlcss/pull/299 https://github.com/elchininet/postcss-rtlcss/pull/296

If you can help me with that I will appreciate it.

Note: all the rest of my repos are working as always, I am having issues only with this repo with the same configuration as the rest and which has been working in that way for years.

Regards

jrfnl commented 3 months ago

@afinetooth With the incident resolved and the post-mortem published, shall we close this issue ?

afinetooth commented 2 months ago

@jrfnl Sounds great. I've done that. 🙏

afinetooth commented 2 months ago

@elchininet since I believe your issue fall outside the context of the incident that was the subject of this issue, let's please move your issue to another ticket. I'll leave it up to you if you'd like to create a new ticket here or email me at support@coveralls.io, but either way, I will follow up as soon as I can and reply back (in email or to a new issue that I can create if you haven't already).

korya commented 2 months ago

@afinetooth Thanks for the detailed post-mortem. I confirm that this issue is fixed for us.

elchininet commented 2 months ago

@elchininet since I believe your issue fall outside the context of the incident that was the subject of this issue, let's please move your issue to another ticket. I'll leave it up to you if you'd like to create a new ticket here or email me at support@coveralls.io, but either way, I will follow up as soon as I can and reply back (in email or to a new issue that I can create if you haven't already).

Perfect, I‘ll create another issue and in that way we don‘t spam others that have already resolved their issues.

I already sent an email on Sunday to support@coveralls.io

Regards