Data transfer between web server and redis cache is too high

onlinebizsoft commented 3 years ago

Summary (*)

We run a Magento site on multiple servers on multiple aws regions, the website have multple domains and multiple languages and the catalog has many products as well.

We relealized that the data transfer cost on the aws bill is quite much higher than normal. It cover up to 50-60% the total.

We figured out that most of data transfer was from Elasticache (one region) to EC2.

Examples (*)

Proposed solution

From our side we are looking at 2 things

Setup multiple regions for elasticache https://aws.amazon.com/blogs/database/reduce-cost-and-boost-throughput-with-global-datastore-for-amazon-elasticache-for-redis/
Enable L2 caching https://devdocs.magento.com/guides/v2.4/config-guide/cache/two-level-cache.html

I think the code core need to be re-worked.

Please provide Severity assessment for the Issue as Reporter. This information will help during Confirmation and Issue triage processes.

[ ] Severity: S0 - Affects critical data or functionality and leaves users with no workaround.
[ ] Severity: S1 - Affects critical data or functionality and forces users to employ a workaround.
[x] Severity: S2 - Affects non-critical data or functionality and forces users to employ a workaround.
[ ] Severity: S3 - Affects non-critical data or functionality and does not force users to employ a workaround.
[ ] Severity: S4 - Affects aesthetics, professional look and feel, “quality” or “usability”.

m2-assistant[bot] commented 3 years ago

Hi @onlinebizsoft. Thank you for your report. To help us process this issue please make sure that you provided the following information:

Summary of the issue
Information on your environment
Steps to reproduce
Expected and actual results

Please make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, please, add a comment to the issue:

@magento give me 2.4-develop instance - upcoming 2.4.x release

For more details, please, review the Magento Contributor Assistant documentation.

Please, add a comment to assign the issue: @magento I am working on this

Join Magento Community Engineering Slack and ask your questions in #github channel.

:warning: According to the Magento Contribution requirements, all issues must go through the Community Contributions Triage process. Community Contributions Triage is a public meeting.

:clock10: You can find the schedule on the Magento Community Calendar page.

:telephone_receiver: The triage of issues happens in the queue order. If you want to speed up the delivery of your contribution, please join the Community Contributions Triage session to discuss the appropriate ticket.

:movie_camera: You can find the recording of the previous Community Contributions Triage on the Magento Youtube Channel

:pencil2: Feel free to post questions/proposals/feedback related to the Community Contributions Triage process to the corresponding Slack Channel

onlinebizsoft commented 3 years ago

P/S Im on 2.4.2 so it is not same as https://github.com/magento/magento2/issues/32118

pmonosolo commented 3 years ago

P/S Im on 2.4.2 so it is not same as #32118

^^^That issue is still present on 2.4.2-p1

onlinebizsoft commented 3 years ago

@matiashidalgo @Vinai @hostep @mrtuvn @DavorOptiweb

mrtuvn commented 3 years ago

Seem much complex with use-case multiple domain? Sorry but i dont have much experiences at this case ! You should add some details. What version that you are experiencing this problem. Any customises modules used

onlinebizsoft commented 3 years ago

@mrtuvn yes, multiple domain, multiple websites

Im on 2.4.2 with many customization however I believe there is nothing causing this. Our Magento installation is around 20GB (without media and without databases

IbrahimS2 commented 3 years ago

@onlinebizsoft Have you configured FPC to utilise redis?

onlinebizsoft commented 3 years ago

@IbrahimS2 no, we end up with using single zone for the system now so all EC2 and elasticache instances are on same zone (cut down 50%-60% the bill)

IbrahimS2 commented 3 years ago

@onlinebizsoft Please share the Redis configuration section from your env.php.

onlinebizsoft commented 3 years ago

@IbrahimS2 ............ 'x-frame-options' => 'SAMEORIGIN', 'MAGE_MODE' => 'production', 'cache' => [ 'frontend' => [ 'default' => [ 'backend' => 'Cm_Cache_Backend_Redis', 'backend_options' => [ 'server' => 'xxxxxxxxxxxxxxxxxxxxxxxxx', 'port' => '6379', 'persistent' => '', 'database' => 1, 'password' => '', 'force_standalone' => 0, 'connect_retries' => 2, 'read_timeout' => 10, 'automatic_cleaning_factor' => 0, 'compress_tags' => 1, 'compress_data' => 1, 'compress_threshold' => 20480, 'compression_lib' => 'gzip' ] ] ] ], 'session' => [ 'save' => 'redis', 'redis' => [ 'host' => 'xxxxxxxxxxxxxxxx', 'port' => '6379', 'password' => '', 'timeout' => '2.5', 'persistent_identifier' => '', 'database' => '0', 'compression_threshold' => '2048', 'compression_library' => 'gzip', 'log_level' => '1', 'max_concurrency' => '16', 'break_after_frontend' => '5', 'break_after_adminhtml' => '30', 'first_lifetime' => '600', 'bot_first_lifetime' => '60', 'bot_lifetime' => '7200', 'disable_locking' => '1', 'min_lifetime' => '60', 'max_lifetime' => '2592000' ] ], 'cache_types' => [ 'config' => 1, 'layout' => 1, 'block_html' => 1, 'collections' => 1, 'reflection' => 1, 'db_ddl' => 1, 'compiled_config' => 1, 'eav' => 1, 'customer_notification' => 1, 'config_integration' => 1, 'config_integration_api' => 1, 'full_page' => 1, 'config_webservice' => 1, 'translate' => 1, 'vertex' => 1 ],..........................

mrtuvn commented 3 years ago

cc: @vzabaznov may know better at this section

onlinebizsoft commented 3 years ago

Please keep in mind that data transfer between EC2 and Redis it is much much bigger than response data to traffic (from both nginx and varnish)

I'm thinking it can be possible that some private data ajax action could cause data transfer? So these ajax actions might serve very small data but it still request and transfer big data from redis?

mrtuvn commented 3 years ago

about ajax request seem we have fixed this case here https://github.com/magento/magento2/pull/31933 Landed in 2.4.3

onlinebizsoft commented 3 years ago

@mrtuvn not really, it may help a bit but not fix whole the problem. Also there are more cases with Ajax on any customized system. The root of the problem is with how/which Magento 2 cache and fetch cache (in my case, it is a big system multiple store, multiple languages, many products, not sure if any every cache component are spitted correctly for each store,........ our Redis instance reaches 60GB after around 2 days, 100GB after around 6 days,....)

Any one have experience with L2 caching setup? Not sure if it is effective because each web server will have a very small memory storage

mrtuvn commented 3 years ago

Yeap that's why i have tagged a guy above previous reply. He is the guy play role performance team lead

onlinebizsoft commented 3 years ago

@vzabaznov

Setup multiple regions for elasticache https://aws.amazon.com/blogs/database/reduce-cost-and-boost-throughput-with-global-datastore-for-amazon-elasticache-for-redis/

This approach doesnt work because it has only write on the primary instance and only read on all instances.

So Im thinking we could have a work around if we separate read and write redis connection in env.php.

What do you think?

Gelmo commented 3 years ago

It's a bit concerning that the Magento team has not made an announcement related to this issue.

We are seeing a significant increase in outgoing network usage from the Redis instance being used for cache in most of our client environments that are running 2.3.7 or 2.4.2. Sites that previously had a maximum output of 200mbps from the Redis instance being used for cache are now experiencing over 1gbps at times since upgrading, and these are relatively small sites. One of our larger clients is now going above 5gbps on most days. We have been able to reduce the impact by disabling the Magento_CSP module in some cases, however, the overall outgoing throughput is still significantly higher than prior to the upgrades.

It would be great if someone from Magento/Adobe could acknowledge this issue and confirm that this is being worked on. While this may not have a major impact on Adobe Cloud customers, the impact is significant for AWS clients due to the increased billing associated with network usage. I can only imagine how many bare metal 2.4.2 environments are in the wild with an NIC that only supports 1gbps.

jonathanribas commented 3 years ago

Hi there, we finally feel not alone anymore in this scenario! Thanks for opening an issue!

As we have decided to be resilient, we run Adobe Commerce on AWS EC2 several regions for a zone. We have seen a huge impact on the famous Data Transfer (intra regional) AWS cost line.

We have 20 store views (and growing), run FPC with Redis.

We have decided to dig into the topic checking for improvements:

As a basic step, we have enabled gzip compression. Haven't tried other ones. It has reduced Data Transfer a bit.
Remove some unnecessary Ajax calls to /customer/section/load: we had some bad practices on custom code hitting hard our backend. It was most used route, even more than Page Cache one!
Play with Redis preload keys: I've checked most common keys loaded on must busy pages and set them on Redis cache: CMS / PLP / PDP. Difficult to see an improvement...
Play with L2 cache: it was a nightmare, we had to rollback. PHP-FPM memory was growing so fast, it seems it introduced a memory leak and we also lost connectivity to backoffice sessions for some users. After rolling back only this change everything was fine.
Check if we are not setting big custom keys in Redis on our custom code
Check for optimizations on native keys: still in progress, not an easy one

Maybe some improvements for the future for Adobe Commerce?

Adobe Commerce should stop setting whole store views system cache config into Redis cache at init. If you open English store view for US website, it should only set this store view cache and when you get it back from cache for next page views it should only get it's own store views system cache configs
Rely less on Redis: translations files are all stored on Redis with a 7200 seconds TTL only (we have done a patch to set unlimited cache key lifetime). All Zend Locale currencies are stored on Redis too. Having Opcache enabled should do the job, don't you think?
There is still an open issue for Redis cache load on concurrent requests with the actual sleep method in place. Trying to improve this part, we have set an unlimited cache key lifetime for translations as there are several translations keys and weight some mb on locale folder for multi languages stores.
FPC still chats a lot with Redis for pages already in cache. It should render pages directly without all those preload keys Redis calls and other ones

I hope you guys won't tell us to use Varnish in order to decrease this Data Transfer chat between Adobe Commerce and Redis.

mrtuvn commented 3 years ago

Not sure but magento already updated redis dependencies in composer.json. (latest code)

"colinmollenhour/cache-backend-file": "~1.4.1",
"colinmollenhour/cache-backend-redis": "^1.14",
"colinmollenhour/credis": "1.12.1",
"colinmollenhour/php-redis-session-abstract": "~1.4.0",

Not sure how much affected and relate with this issue

version 2.4.3

"colinmollenhour/cache-backend-file": "~1.4.1",
"colinmollenhour/cache-backend-redis": "1.11.0",
"colinmollenhour/credis": "1.11.1",
"colinmollenhour/php-redis-session-abstract": "~1.4.0",

https://github.com/magento/magento2/blob/2.4-develop/composer.json If not i think we still open for pull requests for this such case

magenx commented 3 years ago

is this issue related to single redis instance or cluster? any tried to connect redis auditor/profiler to see whats inside its doing?

https://devdocs.magento.com/guides/v2.3/release-notes/release-notes-2-3-5-open-source.html#performance-boosts https://devdocs.magento.com/guides/v2.4/release-notes/release-notes-2-4-0-open-source.html#performance-improvements

jonathanribas commented 3 years ago

I have also noticed that SYSTEM Redis cache key for 16 websites weights 10MB. After serialize and encrypt it (M2 core), it weights 14MB! Core source code about this: https://github.com/magento/magento2/blob/2.4-develop/app/code/Magento/Config/App/Config/Type/System.php#L338

If we suppose custom configurations for passwords are already encrypted (payments ...), I don't understand why we encrypt the whole thing again. We lose precious time here serializing / encrypting and after decrypting when getting those keys all the time... Do you guys know the reason why this SYSTEM Redis cache key is encrypted?

With such a size key, it may explain issues on parallel generation...

vzabaznov commented 2 years ago

Hey guys, thank you for reporting please consider to use L2 cache https://devdocs.magento.com/guides/v2.4/config-guide/cache/two-level-cache.html

jonathanribas commented 2 years ago

L2 cache was a disaster on our Kubernetes cluster, really bad performance. We will try to give it a try once again.

theozzz commented 2 years ago

@jonathanribas any update on your issue / pain point? Thanks for your answer

jonathanribas commented 2 years ago

Hi @theozzz, unfortunately we don't had time to test L2 cache again. Are you experiencing same issue on high data transfer?

theozzz commented 2 years ago

@jonathanribas thanks for your answer.

We are experiencing aswell some issues on Redis transfer slowness (we notified it on NewRelic), especially when traffic is high. The platform got 32 stores and 22 websites.

Preload keys seems not to have any "big" impact for us.

ryanisn commented 2 years ago

@theozzz do you have cross region Redis and PHP server, or they are all in the same region?

igorwulff commented 2 years ago

Hi @jonathanribas we've been using L2 caching with good success on Magento Cloud environments in combination with preloading. With catalogs stretching >300k of products and customer groups on our side, etc...

It seems you are not using Varnish as an FPC? I would strongly recommend using Varnish (or Fastly) in your setup.

We have been forced to remove unnecessary cachetags through our own cache tag optimizer module on specific places just to avoid having to many cache invalidations all over the place, created a Optimized Layout Handle module (to merge layout caches, where possible) and removed various modules that were just bad for performance but are part of Magento Core, but not needed. Not to mention a large amount of patches to fix issues in Magento Core that we found in the last few years. We do try to PR them sometimes here, but the process of code reviewing is so incredibly tedious and slow on the Magento repo.

Your problem is probably within a combination of these things, and Magento themselves are in no hurry to fix them. They have been major issues since Magento 1. Especially the Layout Handle Optimizations.. hits Redis with really high memory usage when you have a lot of category+product pages (hget calls would still remain the same though).

Nuranto commented 2 years ago

We are using L2 caching also. It gives good performance results, but we have often issues with some obsolete local cache which get not invalidated as it should. We are actually considering to disable it...

engcom-Lima commented 2 years ago

Hi @onlinebizsoft ,

Thank you for reporting the issue and collaboration.

Can you please provide us the exact steps to reproduce the issue. Also can you please confirm if the issue is still reproducible with latest 2.4-develop instance.

Thanks !!

m2-assistant[bot] commented 2 years ago

Hi @engcom-Lima. Thank you for working on this issue. In order to make sure that issue has enough information and ready for development, please read and check the following instruction: :point_down:

[ ] 1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).
Details
If the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.
[ ] 2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.
[ ] 3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.
[ ] 4. Verify that the issue is reproducible on 2.4-develop branch
Details
- Add the comment @magento give me 2.4-develop instance to deploy test instance on Magento infrastructure.
- If the issue is reproducible on 2.4-develop branch, please, add the label Reproduced on 2.4.x.
- If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and stop verification process here!
[ ] 5. Add label Issue: Confirmed once verification is complete.
[ ] 6. Make sure that automatic system confirms that report has been added to the backlog.

onlinebizsoft commented 2 years ago

Hi @onlinebizsoft ,

Thank you for reporting the issue and collaboration.

Can you please provide us the exact steps to reproduce the issue. Also can you please confirm if the issue is still reproducible with latest 2.4-develop instance.

Thanks !!

This is not an issue to be reproducible. I advise you to get your team to look at and verify. There were quite many discussion in this thread to confirm the issue.

Nuranto commented 2 years ago

Hi @onlinebizsoft

Do you use remote storage (AWS S3, Minio, ..) ? If so, maybe a part of the issue is linked to https://github.com/magento/magento2/issues/35820

That could match to @Gelmo's comment also, since remote cache module came with 2.4.2 cf https://devdocs.magento.com/guides/v2.4/config-guide/remote-storage/config-remote-storage.html

(There's also https://github.com/magento/magento2/issues/35839 which forces us to use a standalone redis architecture, or load_from_slave undocumented option which is probably less performant than using sentinel.)

onlinebizsoft commented 2 years ago

@Nuranto no, we dont use remote storage. We only use NFS and we have CDN in front

Nuranto commented 2 years ago

Ok, too bad, it could have explained this one.

Nuranto commented 2 years ago

@onlinebizsoft Another possible cause is detailed here : https://github.com/magento/magento2/issues/34758 (The issue is not PWA-related only.)

Another read: https://www.integer-net.com/resolving-session-bottleneck-magento-ajax-requests/

We are seeing such issues in our projects for ajax calls such as search requests. In some of these, there are hundreds of calls to redis to see if it can get the lock... You can try one of the solutions : disable locking but it is not very safe, or try the integer-net extension. We'll try the second one on our side.

onlinebizsoft commented 2 years ago

@Nuranto our system is using "disable locking" already.

m2-assistant[bot] commented 1 year ago

Hi @engcom-Hotel. Thank you for working on this issue. In order to make sure that issue has enough information and ready for development, please read and check the following instruction: :point_down:

[ ] 1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).
Details
If the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.
[ ] 2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.
[ ] 3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.
[ ] 4. Verify that the issue is reproducible on 2.4-develop branch
Details
- Add the comment @magento give me 2.4-develop instance to deploy test instance on Magento infrastructure.
- If the issue is reproducible on 2.4-develop branch, please, add the label Reproduced on 2.4.x.
- If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and stop verification process here!
[ ] 5. Add label Issue: Confirmed once verification is complete.
[ ] 6. Make sure that automatic system confirms that report has been added to the backlog.

engcom-Hotel commented 1 year ago

Hello @onlinebizsoft,

Are you still facing this issue? Can you please try to reproduce the issue in the latest 2.4-develop branch and let us know if the issue is still reproducible for you?

Thanks

engcom-Hotel commented 1 year ago

Dear @onlinebizsoft,

We have noticed that this issue has not been updated for a period of 14 Days. Hence we assume that this issue is fixed now, so we are closing it. Please raise a fresh ticket or reopen this ticket if you need more assistance on this.

Regards

onlinebizsoft commented 1 year ago

@mrtuvn @engcom-Hotel @vzabaznov can you get this opened?

mrtuvn commented 1 year ago

Reopen due author ticket got response. Can you update problem still able to reproduce ? @onlinebizsoft I'm not sure how to reproduce clear steps for QC can reproduce such case in auto infra tests

onlinebizsoft commented 1 year ago

The problem is still existing but we don't have any deeper information. This is confirmed by quite some users https://community.magento.com/t5/Magento-2-x-Technical-Issues/Significant-increase-in-outgoing-network-usage-from-Redis-cache/td-p/481801

Remember that in our case, we have up to 100 stores (but traffic is not kind of 100 busy websites) and here is networkbytesout from redis

P/S : Again, we are always on latest Magento version, we are using Varnish for full page cache.

onlinebizsoft commented 1 year ago

Another related issue https://github.com/magento/magento2/issues/21334

jonathanribas commented 1 year ago

I have also noticed that SYSTEM Redis cache key for 16 websites weights 10MB. After serialize and encrypt it (M2 core), it weights 14MB! Core source code about this: https://github.com/magento/magento2/blob/2.4-develop/app/code/Magento/Config/App/Config/Type/System.php#L338

If we suppose custom configurations for passwords are already encrypted (payments ...), I don't understand why we encrypt the whole thing again. We lose precious time here serializing / encrypting and after decrypting when getting those keys all the time... Do you guys know the reason why this SYSTEM Redis cache key is encrypted?

With such a size key, it may explain issues on parallel generation...

On our side we have removed encryption / decryption of cache config and results are really good! We have reduce our Data Transfer bill around 30 to 40%!

onlinebizsoft commented 1 year ago

@jonathanribas so look like most of data transfer is because of the SYSTEM redis cache key ?

jonathanribas commented 1 year ago

@onlinebizsoft yes it is. Lot of replication inside save keys, it's a mess touching something here without breaking something as whole Adobe Commerce / Magento relies on this caching system.

onlinebizsoft commented 1 year ago

@igorwulff did you make any separated measurement for preload key in Redis? From what I can see, this has no improvement for me and it seems to be very useless to collect small redis key one by one to save some redis call in total 100-200 redis calls for each page.

What do you think? Or something I don't understand about this feature?

jonathanribas commented 1 year ago

@onlinebizsoft, if you use AWS and more than one zone inside your region, take a look at this AWS notification.This should help reducing Data Transfer between zones of same region.

We have observed that your Amazon VPC resources are using a shared NAT Gateway across multiple Availability Zones (AZ). To ensure high availability and minimize inter-AZ data transfer costs, we recommend utilizing separate NAT Gateways in each AZ and routing traffic locally within the same AZ.

Each NAT Gateway operates within a designated AZ and is built with redundancy in that zone only. As a result, if the NAT Gateway or AZ experiences failure, resources utilizing that NAT Gateway in other AZ(s) also get impacted. Additionally, routing traffic from one AZ to a NAT Gateway in a different AZ incurs additional inter-AZ data transfer charges. We recommend choosing a maintenance window for architecture changes in your Amazon VPC.

max-grosch commented 1 year ago

I use this and it works well: https://github.com/Genaker/FastFPC

magento / magento2