magefree / mage

Magic Another Game Engine
http://xmage.today
MIT License
1.81k stars 749 forks source link

Server crashes a lot recently (memory overflow restarts) #11285

Open CozyDoomer opened 9 months ago

CozyDoomer commented 9 months ago

The interval varies quite a lot but in my experience, it happens roughly every few minutes to 2 hours.

People in the in-game chat are saying it happens when a near-infinite amount of triggers are created (e.g. storm with way more than 20 triggers) but I have no evidence for this. I do know however that recently the game has been almost unplayable due to the crashes so I'd like to help where I can.

@JayDi85 do you have anything in your logs? Also not sure if this came up already - I couldn't find any issues on this except maybe #10656.

JayDi85 commented 9 months ago

There are only two candidates:

  1. Commander cards with miss init code (details in #11081). That error can freeze game and run infinite errors logs until server's restart. It's not a 100% of all memory overflow/restarts.
  2. Transformable cards with too much calls of game cycle: It's a daybound/nightbound cards like [[Child of the Pack]] that can cause too much calls of checkStateBasedActions -> transform in the games. If you put that cards to the battlefield then it will dramatically increase cpu usage and applyEffects calls, see attached screenshot from one of the heavy server's game:

shot_231009_084729

shot_231009_085259

github-actions[bot] commented 9 months ago

Child of the Pack // Savage Packmate - (Gatherer) (Scryfall) (EDHREC)

{2}{R}{G} Creature — Human Werewolf 2/5 {2}{R}{G}: Create a 2/2 green Wolf creature token. Daybound (If a player casts no spells during their own turn, it becomes night next turn.) :arrows_counterclockwise: Creature — Werewolf 5/5 Trample Other creatures you control get +1/+0. Nightbound (If a player casts at least two spells during their own turn, it becomes day next turn.)

whispy commented 9 months ago

+1, the server crashes all of the time. I'd be willing to contribute to a bounty to get this fixed and reduce crashes.

JayDi85 commented 9 months ago

There are added new global features like Prototype mechanics (#11249), so prepare for more unstable times until this whole things stabilises.

ssk97 commented 9 months ago

There are added new global features like Prototype mechanics (#11249), so prepare for more unstable times until this whole things stabilises.

I don't really see how Prototype could cause a server crash, worst case would be a game crash that doesn't touch the rest of the server, right? Unless the single boolean added to each permanent somehow is too much extra data...

JayDi85 commented 9 months ago

worst case would be a game crash that doesn't touch the rest of the server, right?

Nope. All server restarts related to the memory overflow error: java's garbage collector (GC) can't clean and free objects due circular references between game states or other game objects. So one broken game can eats all server's memory and crash whole server. Under the hood: xmage uses thread's pool to process games and you can't limit memory usage for specific thread/game.

The main problem of memory overflow errors -- it's hard to catch and repeat. It must be a combo of two things:

Well, whole xmage code base can generates such "shared" objects, but it's ok in 99% due game lifecycle. On game ends it will be closed and all game's memory will be free anyway. But inifinite/freeze game doesn't ends and eats memory by time.

There are possible another use cases like infinite number of objects like AI and mana variations calculation, AI and infinite choose dialogs, infinite number of tokens (no more actual), etc. But it's more easy to catch.

P.S. The main "problem" of new big features -- it adds new batch of possible bad combos or use cases to generate such infinite loops and freezes. So players can find out new steps to reproduce it on the server.

P.P.S. Related topics about memory problem:

ssk97 commented 9 months ago

P.S. The main "problem" of new big features -- it adds new batch of possible bad combos or use cases to generate such infinite loops and freezes. So players can find out new steps to reproduce it on the server.

Right, but Prototype specifically isn't really doing anything new there that Transform isn't already doing. Though it does seem that Transform does have performance issues which, if true, could be made worse by Prototype I suppose. It doesn't involve any game.applyEffects() calls though so the performance issue mentioned above shouldn't matter.

JayDi85 commented 9 months ago

One of the use case from #9302:

Putting 100s of abilities on the stack at once will crash the server Easiest way to reproduce, add 10-30 [[Sharding Sphinx]] to your board and a [[Platinum Angel]] to the AI so they won't lose the game. When damage is deal this will put a few hundred triggers from the Sharing Sphinx on the board. This will cause the server to lock up a large amount of RAM (20 Sphinxes will lock up ~450MB of ram). This ram can't be CG'ed away until the stack empties.

ShadowKayoto commented 9 months ago

I'd also like to chime in that this should be the most urgent issue to look into and address. No amount of coding efforts on this project matter if no one can actually enjoy what's been created and maintained all this time, which is currently the case.

Is there any way to identify and prevent some of these cases from occurring? It looks like you have metrics and call stacks available, so figuring out the most frequent causes should be possible, and then prioritizing the fixes from there.

Like @whispy mentioned already, I'm also willing to donate towards the goal of getting the server stabilized.

JayDi85 commented 9 months ago

Load tests can be useful to find problems with the cards or AI. Also it allows to test cards and server's stability.

As example:

shot_231011_151841

shot_231011_150253

JayDi85 commented 9 months ago

Main memory problem fixed, so server must be more stable. See details here: https://github.com/magefree/mage/issues/9302#issuecomment-1762853919. But there are still todos and check for next updates, e.g. CPU usage on Transform ability (see above).

JayDi85 commented 9 months ago

One of the unknown use case with 90% CPU freeze: Constructed - Legacy | DaddyGreen, Urzasdestiny [quit] | Two Player Duel |

AddCountersSourceEffect and some triggers:

shot_231014_225008 shot_231014_223948 shot_231014_230957

Empty stack and permanents on battlefield (except lands, I'm not sure that it's a moment of high CPU or after it, but it's a problem game 100%):

github-actions[bot] commented 9 months ago

City of Traitors - (Gatherer) (Scryfall) (EDHREC)

Land When you play another land, sacrifice City of Traitors. {T}: Add {C}{C}.

Grim Lavamancer - (Gatherer) (Scryfall) (EDHREC)

{R} Creature — Human Wizard 1/1 {R}, {T}, Exile two cards from your graveyard: Grim Lavamancer deals 2 damage to any target.

Dragon's Rage Channeler - (Gatherer) (Scryfall) (EDHREC)

{R} Creature — Human Shaman 1/1 Whenever you cast a noncreature spell, surveil 1. (Look at the top card of your library. You may put that card into your graveyard.) Delirium — As long as there are four or more card types among cards in your graveyard, Dragon's Rage Channeler gets +2/+2, has flying, and attacks each combat if able.

Delver of Secrets // Insectile Aberration - (Gatherer) (Scryfall) (EDHREC)

{U} Creature — Human Wizard 1/1 At the beginning of your upkeep, look at the top card of your library. You may reveal that card. If an instant or sorcery card is revealed this way, transform Delver of Secrets. :arrows_counterclockwise: Creature — Human Insect 3/2 Flying

Chrome Mox - (Gatherer) (Scryfall) (EDHREC)

{0} Artifact Imprint — When Chrome Mox enters the battlefield, you may exile a nonartifact, nonland card from your hand. {T}: Add one mana of any of the exiled card's colors.

xenohedron commented 9 months ago

Thanks for your deep research and improvements @JayDi85 , great work

JayDi85 commented 9 months ago

Another one possible use case from a problem game:

github-actions[bot] commented 9 months ago

Rite of Replication - (Gatherer) (Scryfall) (EDHREC)

{2}{U}{U} Sorcery Kicker {5} (You may pay an additional {5} as you cast this spell.) Create a token that's a copy of target creature. If this spell was kicked, create five of those tokens instead.

Ancient Greenwarden - (Gatherer) (Scryfall) (EDHREC)

{4}{G}{G} Creature — Elemental 5/7 Reach You may play lands from your graveyard. If a land entering the battlefield causes a triggered ability of a permanent you control to trigger, that ability triggers an additional time.

Primal Vigor - (Gatherer) (Scryfall) (EDHREC)

{4}{G} Enchantment If one or more tokens would be created, twice that many of those tokens are created instead. If one or more +1/+1 counters would be put on a creature, twice that many +1/+1 counters are put on that creature instead.

Nyxbloom Ancient - (Gatherer) (Scryfall) (EDHREC)

{4}{G}{G}{G} Enchantment Creature — Elemental 5/5 Trample If you tap a permanent for mana, it produces three times as much of that mana instead.

jeffwadsworth commented 9 months ago

@JayDi85 I could have sworn I saw that "test_TwoAIPlayGame_Multiple" test in the test folder in the past, but now I can't find it.

JayDi85 commented 9 months ago

@jeffwadsworth It ignored by default. Mage.Verify project -> LoadTest.java -> test_TwoAIPlayGame_Multiple

Susucre commented 9 months ago

@JayDi85 an user on the Discord reported a crash of the beta server. The reported situation was 500 [[Ramos Dragon Engine]] tokens then triggering them all by casting a multicolored spell.

github-actions[bot] commented 9 months ago

Ramos, Dragon Engine - (Gatherer) (Scryfall) (EDHREC)

{6} Legendary Artifact Creature — Dragon 4/4 Flying Whenever you cast a spell, put a +1/+1 counter on Ramos, Dragon Engine for each of that spell's colors. Remove five +1/+1 counters from Ramos: Add {W}{W}{U}{U}{B}{B}{R}{R}{G}{G}. Activate only once each turn.

JayDi85 commented 7 months ago

Need more combo examples to crash a server (triggers overflow, etc).

JayDi85 commented 7 months ago

Last crash from that game with high CPU usage: [Not CDEH/ No infinite combo] Martin - azert - B24M - vsdiaz shot_231128_184452

Don't know the real battlefield situation or steps to reproduce, but CPU "stack" looks very strange cause LookAtTopCardOfLibraryAnyTimeEffect is ContinuousEffect -- it applies on every game cycle. BUT it calls a network code to send data to users (fireUpdatePlayersEvent): shot_231128_191411 shot_231128_191459

I think it was a workaround to force a user update (in old days reveals/lookat windows may miss on choose dialog until next update/avatar click). So that broken code (fire event) must be removed from all places, not only continues effects: shot_231128_192109

JayDi85 commented 7 months ago

Just interesting screenshot with good and bad games example:

shot_231128_194358

JayDi85 commented 7 months ago

well, it's can be a real cheater/troll now -- crashed server on main phase of turn 1 (T1M1) with 300 replacement effects and 500 permanents on battlefield

GAME started 5234893d-a6e0-4ed7-baf3-2840f39599cf [] icarus - oranges

Freeze on choose chooseReplacementEffect with 230 effects to choose: shot_231129_034327

3 auto-selected choices by user: [[Ancient Greenwarden]], [[Nyxbloom Ancient]] and [[Primal Vigor]] shot_231129_033018

1 spell on stack: [[Mythos of Illuna]] shot_231129_033416

~500 permanents on battlefield: shot_231129_033337 shot_231129_033344 shot_231129_033351

github-actions[bot] commented 7 months ago

Ancient Greenwarden - (Gatherer) (Scryfall) (EDHREC)

{4}{G}{G} Creature — Elemental 5/7 Reach You may play lands from your graveyard. If a land entering the battlefield causes a triggered ability of a permanent you control to trigger, that ability triggers an additional time.

Nyxbloom Ancient - (Gatherer) (Scryfall) (EDHREC)

{4}{G}{G}{G} Enchantment Creature — Elemental 5/5 Trample If you tap a permanent for mana, it produces three times as much of that mana instead.

Primal Vigor - (Gatherer) (Scryfall) (EDHREC)

{4}{G} Enchantment If one or more tokens would be created, twice that many of those tokens are created instead. If one or more +1/+1 counters would be put on a creature, twice that many +1/+1 counters are put on that creature instead.

Mythos of Illuna - (Gatherer) (Scryfall) (EDHREC)

{2}{U}{U} Sorcery Create a token that's a copy of target permanent. If {R}{G} was spent to cast this spell, instead create a token that's a copy of that permanent, except the token has "When this permanent enters the battlefield, if it's a creature, it fights up to one target creature you don't control."

JayDi85 commented 7 months ago

Yes, it's a real troll, played 1 vs 1 game from same computer, but with two diff clients. From USA, IP address recorded too.

JayDi85 commented 7 months ago

Another use case from a normal game with infinite game freeze on mana calculations -- only 25 permanents on battlefield, but something do an explosive {any} mana growth in available mana calculation.

It must have additional limits for such use cases.

shot_231129_094002 shot_231129_093820 shot_231129_093435

JayDi85 commented 7 months ago
JayDi85 commented 6 months ago

Server stats from last week: ~7 restarts per day

JayDi85 commented 5 months ago

Found two use cases for crash protection tests in old reddit’s topic: https://www.reddit.com/r/XMage/comments/ikroo7/i_love_when_i_get_spectators/

TODO:

1 - tokens:

The main deck is OmniTell, that is: Legacy Show and Tell into Omniscience to cast stuff for free, but I've built a version that basically exists as a gun, and the ammo is whatever is in my sideboard.

So the part that crashes the client is to play Opalescence turning all Enchantments into creatures, then playing a Dual Nature to get tokens of creatures (enchantments) and Doubling Seasons to double the number of tokens. However, when Dual Nature enters the battlefield, Opalescence makes it a creature, so it triggers itself. That means there are two Dual Natures. When I play Doubling Season, two Dual Nature triggers go onto the stack, trying to each place a token copy of Doubling Season. After some number of these resolve the next Dual Nature trigger to resolve places token Doubling Seasons doubled for every other token. This makes it grow very very quickly in number, as it basically creates 2^N + N tokens for each Dual Nature trigger to resolve where N is the number of Doubling Seasons. So for two plays of Doubling Season (which is where the game lags hard and locks up) this is about 2^2059 tokens, each a 5/5 creature token copy of Doubling Season.

2 - triggers:

If one were to use this deck in real life, you can easily keep track of the number of tokens and just put some piece of paper to represent it, right?

Wrong.

I also have included in the deck Grip of Chaos, which will be copied a large number of times. Then if I were to happen to play a spell which targets a creature, ALL of those triggers go onto the stack and EACH one triggers ALL of them. In a tournament, this means you have to call over a judge to resolve them, and they would be very disappointed in me.

3 - more triggers with double season IMG_9901 IMG_9902

JayDi85 commented 5 months ago

TODO:

xenohedron commented 5 months ago

TODO: find and replace inner classes to static (possible memory leaks). ~50 places to fix.

I'm starting to work on this

xenohedron commented 5 months ago

TODO: find and replace inner classes to static (possible memory leaks). ~50 places to fix.

Most addressed, see #11798 and #11810

JayDi85 commented 4 months ago

TODO:

Related details: shot_240330_151414 shot_240330_151139 shot_240330_145653 shot_240330_152314

github-actions[bot] commented 4 months ago

Vesuvan Shapeshifter - (Gatherer) (Scryfall) (EDHREC)

{3}{U}{U} Creature — Shapeshifter 0/0 As Vesuvan Shapeshifter enters the battlefield or is turned face up, you may choose another creature on the battlefield. If you do, until Vesuvan Shapeshifter is turned face down, it becomes a copy of that creature, except it has "At the beginning of your upkeep, you may turn this creature face down." Morph {1}{U}

Hooded Hydra - (Gatherer) (Scryfall) (EDHREC)

{X}{G}{G} Creature — Snake Hydra 0/0 Hooded Hydra enters the battlefield with X +1/+1 counters on it. When Hooded Hydra dies, create a 1/1 green Snake creature token for each +1/+1 counter on it. Morph {3}{G}{G} As Hooded Hydra is turned face up, put five +1/+1 counters on it.

JayDi85 commented 4 months ago

So current workaround: leave more free memory for the system and docker. Example:

Use follow formula:

JayDi85 commented 4 months ago

Last crashes was due constant memory leaks without cpu usage or objects overflow. And there is a new suspect - network library with buggy finalizers (source of the some leaks). Maybe new server session code can cause it and require some fixes: #11527

JayDi85 commented 4 months ago

Intermediate report after more researches and memory dumps analysing from last weeks (server with disabled AI):

  1. [x] Memory leaks by games (e.g. too many game states) -- no, it's not actual any more. No freeze games, no infinite game states generation;
  2. [x] Memory leaks by another code (e.g. third party library, see prev comment) -- no, it's not actual. It has some potentially leaked traces but it's too small (few KBs, not MBs or GBs);
  3. [x] 100% CPU usage like infinite transform -- no, it's not actual. Transform bug exists, but real games and server can do nothing with it. There were only few use cases with high CPU usage (maybe related to transform, maybe related to another heavy calculations), but it last for few minutes without any consequences (no crashes, no memory overuse);

But I found very perspective problem:

Server's memory usage example: shot_240315_205747

GC works example - it ignore most of the heap (old gen objects) until manually GC call (new and old gen objects clean): shot_240315_202309

No memory leaks after manual GC call: shot_240315_201918

Here is related SO topic with possible solutions for such problem. Experiments with the server will continue. Looks like it coming to finish.

JayDi85 commented 4 months ago

BTW server with 30 active games uses less than 500 MB after 12 hours uptime -- it's related to low hardware requirements and good/stable performance.

shot_240315_212003

JayDi85 commented 4 months ago

dharma-lost

JayDi85 commented 3 months ago

No more memory crashes, so server live longer to catch worse bugs like empty players list and no started games from #5358.

Real reason: infinite code freeze in string replace (!!!) and server's data lock, so all other code waiting data unlock. It's critical error and contains two independent problems.

TODO:

Combo of both problems can broke whole server (no out of memory, so it must be manually restarted only): shot_240321_102026

Example of infinite freeze and 100% CPU in GainAbilityAttachedEffect with formatRule usage: shot_240321_031315 shot_240321_030806

Example of infinite freeze and 100% CPU in getReminderText: shot_240321_102423 shot_240321_102525

Another example with String.replace - String.indexOf:

java.lang.Thread.State: RUNNABLE
    at java.lang.String.indexOf(String.java:1503)
    at java.util.regex.Matcher.quoteReplacement(Matcher.java:701)
    at java.lang.String.replace(String.java:2240)
    at mage.abilities.costs.AlternativeCostImpl.getReminderText(AlternativeCostImpl.java:66)
    at mage.abilities.costs.AlternativeSourceCostsImpl.getRule(AlternativeSourceCostsImpl.java:93)
    at mage.abilities.AbilitiesImpl.getRules(AbilitiesImpl.java:61)
    at mage.abilities.AbilitiesImpl.getRules(AbilitiesImpl.java:49)
    at mage.util.CardUtil.getCardRulesWithAdditionalInfo(CardUtil.java:1209)
    at mage.util.CardUtil.getCardRulesWithAdditionalInfo(CardUtil.java:1197)
    at mage.cards.CardImpl.getRules(CardImpl.java:235)
    at mage.util.ManaUtil.getColorIdentity(ManaUtil.java:639)
    at mage.cards.CardImpl.getColorIdentity(CardImpl.java:858)
    at mage.view.CardView.setOriginalValues(CardView.java:1174)
    at mage.view.CardView.<init>(CardView.java:340)

After some research and tests -- 100% CPU usage related to memory overflow by big amount of triggers and other data. Latest java 8 affected by string replace freezing bug anyway. So it's still actual.

TODO:

shot_240321_030523

It stuck at pick cards: shot_240321_103124

It can continue on player quit or submit/autosubmit only (maybe notify calls miss in some use cases/places): shot_240321_103109 shot_240321_103052

JayDi85 commented 3 months ago

Interesting game/server freeze example with wrong "synchronized" usage (unnecessary thread locks problem): shot_240409_122949

Original reason: too many mana variations bug from https://github.com/magefree/mage/issues/11285#issuecomment-1831260668, but it show up additional problems with threads lock and game updates -- if game "freeze" then users can't do anything and all requests will be freeze too.

TODO:

P.S. More details and steps to reproduce for sync lock bug. It's interesting -- game ended already (I think due concede of all players), but can't continue and finish due thread lock: shot_240409_124125

JayDi85 commented 2 months ago
JayDi85 commented 2 months ago
github-actions[bot] commented 2 months ago

Basalt Monolith - (Gatherer) (Scryfall) (EDHREC)

{3} Artifact Basalt Monolith doesn't untap during your untap step. {T}: Add {C}{C}{C}. {3}: Untap Basalt Monolith.

Kinnan, Bonder Prodigy - (Gatherer) (Scryfall) (EDHREC)

{G}{U} Legendary Creature — Human Druid 2/2 Whenever you tap a nonland permanent for mana, add one mana of any type that permanent produced. {5}{G}{U}: Look at the top five cards of your library. You may put a non-Human creature card from among them onto the battlefield. Put the rest on the bottom of your library in a random order.

JayDi85 commented 2 months ago

Each server game works in game thread (GameWorker) and must be processed from start to end by that thread. All errors must be processed inside. But some code can cause it to quit without real game finish. That's game will be detached (it will be store in active games list until server restart, but nothing works with it).

Table health check must find and delete it, but it can't -- maybe due active/connected game's players. Server example: ~2% of all games keep active.

Two possible solutions:

It's not critical bug -- just showing that some errors/bugs occurred and must be researched.

JayDi85 commented 2 months ago

shot_240428_044655 shot_240428_044720

JayDi85 commented 2 months ago

After one of the few updates disconnect process starting to freeze sometime for 5-10 seconds for unknown reason (connect to public server, do nothing and try to disconnect). Maybe related to wrong synchronized usage on busy server from [above comments]

(https://github.com/magefree/mage/issues/11285#issuecomment-2044515400). shot_240428_045705

JayDi85 commented 1 month ago
github-actions[bot] commented 1 month ago

Energy Refractor - (Gatherer) (Scryfall) (EDHREC)

{2} Artifact When Energy Refractor enters the battlefield, draw a card. {2}: Add one mana of any color.