progressivetech / net.ourpowerbase.sumfields

The Summary Fields extension creates read-only custom data fields that extend Contact and are automatically populated with up-to-date totals such as total lifetime contributions, last contribution amount, last attended event, etc.
Other
8 stars 29 forks source link

Summaryfields 4.x on large installations #53

Closed VangelisP closed 5 years ago

VangelisP commented 5 years ago

Hello !

Due to the great work (and effort) being done here: https://github.com/progressivetech/net.ourpowerbase.sumfields/issues/39 in order to make it more modular and precise, i do experience huge performance issues on recalculating the Summary Fields after this feature was been introduced. Even during extension upgrade (or just by calling drush cvapi SumFields.Gendata), in all my cases i had timeouts via the UI (which of course makes sense as the upgrade hook is requested to recreate all the data) but also huge processing times via the console. On a specific case right now, i had the calculator query running for over 1h and still haven't finished.

I got an example scenario with the following data: a CiviCRM 5.12.0 installation with 1 million contributions, 280k contacts, 20k memberships.

Same database on 2 targets: 1) On client's premises, decent-tuned mysql (mariadb) with a decent system configuration, process is already running for 5500 seconds and still going. 2) On my side, highly fine-tuned mysql (mariadb) on a good server configuration, process finished to approximately 3800 seconds (63 minutes).

I was examining the new approach, in custom.php here

and i tried instead of the (SELECT contact_id FROM civicrm_contribution WHERE id = NEW.contribution_id) to provide just this: c.contact_id as the contact_id is already in the JOIN statement 2 lines below: JOIN civicrm_contribution AS c ON trigger_table.contribution_id = c.id

This change decreased the query on my server instance to just a little bit over 20 minutes in total, but still times out on my client's server.

I'm still trying to find a way to decrease the time to nearly the original levels (before the line items calculation).

Looks like it's working as it should (brings the contact_id as needed) and i can't think/see of something breaking, all my calculations so far are good (at least on a random sampling).

Do i miss something with this change that i did?

Is there any suggestion of an alternative way (more efficient) to produce the results?

Cheers, Vangelis

UPDATE: Replacing the (SELECT contact_id FROM civicrm_contribution WHERE id = NEW.contribution_id) to provide this c.contact_id makes small difference in the query execution times unfortunately.

jmcclelland commented 5 years ago

Thanks for opening the ticket. There are two distinct places where the queries are run: one is set via triggers and the other runs when you first initialize the data to populate the summary fields table.

The performance problems you are addressing seem to be happening when the data is initialized, not the triggers.

However, that calculated_contact_id, I believe, is only used when the triggers are being executed, not during initialization.

In any event... I have gotten other complaints about the version using line items having performance problems when initialized. I think part of the problem is that it now goes through the line item table which can be signficantly larger than the contribution table. There's not much to do about that. But, maybe there are still some optimizations we could do?

Open to suggestions!

VangelisP commented 5 years ago

Thanks for your input on this!

Indeed, this civicrm_line_item table is big and as far as I see it's already being optimized. The obvious would be to have a contact_id inside there as it would solve many issues but it's something that is related to the core unfortunately.

I'm using cron based calculations as triggers seemed to cause big delays and timeouts when they were doing batch imports or using CiviBanking to process hundreds or thousands of contributions.

I am still looking at it on my spare time, trying to figure out how i can optimize the process.

Thanks again for your feedback!

VangelisP commented 5 years ago

Hello again.

As I wasn't able to resolve or optimize the issue effectively and while i keep getting more and more large installations, at this point i have 2 options:

I could certainly work on a patch for Option B if you want.

What is your opinion @jmcclelland ?

jmcclelland commented 5 years ago

Option B seems reasonable.

In terms of executing it... in custom.php all the fields (and sql queries) are defined. For each of the ones using a line item table, I would suggest creating a second one with a different name, that just uses the contribution table. For example, there is contribution_total_lifetime - so you would add contribution_total_lifetime_simplified that uses the same data from the earlier version that doesn't use the line item tables.

Then, in the UI, by default only the line item fields would be displayed and available to choose. But, if you click the option to show the simplified fields, some javascript code would hide the line item fields and show the simplified ones.

How does that sound to you?

VangelisP commented 5 years ago

Sounds nice! I will start working on a branch for this!

Thank you!

VangelisP commented 5 years ago

Good day @jmcclelland I have provided a PR that does what was described: https://github.com/progressivetech/net.ourpowerbase.sumfields/pull/58

My only concern is that if an existing installation is already using smartgroups with summaryfields that are already there, they will need to delete the smartgroups first and then recreate their simplified summaryfields based on the new values.

Another thing to point out is that if you try to check the newly introduced checkbox, all previous checkboxes of the contribution fields will be unchecked by force. My initial idea was to do that because we don't want fields from both sources, although i am not sure if that idea makes sense.

jmcclelland commented 5 years ago

I just merged your changes in. I've made the default to not display the simplified fields by default. Let me know if you see any problems. Thanks again for your contribution.