magento / architecture

A place where Magento architectural discussions happen
275 stars 155 forks source link

Proposal to refactor the customer and address import depending on AUTO_INCREMENT and showTableStatus #322

Closed larsroettig closed 4 years ago

larsroettig commented 4 years ago

Current usage in customer import (Open Source Edition)

Ideas of Refactoring

1. Use Sequence implementation from Framework

Every Sequence follows the api Magento\Framework\DB\Sequence\SequenceInterface

The benchmark consists of importing 1000 customers an median of 3 measurements.

The benchmarking process is as follows:

  1. Delete all customers
  2. In the admin section navigate to System > Import
  3. In the field Entity Type select "Customers Main File"
  4. In the field Import Behavior select "Add/Update Complex Data."
  5. In the field Select File to Import upload the file containing customer data
  6. Hit button Check Data
  7. Start measuring time
  8. Hit button Import
  9. Stop measuring time
  10. Start over

System setup:

Unmodified

Concept 1 - Use class CustomerSequence implements SequenceInterface implementation

protected function _getNextEntityId()
{
    return $this->sequence->getNextValue();
}   

Advantage:

Disadvantage:

Concept 2 - Magento\Framework\EntityManager\MetadataPool implementation

protected function _getNextEntityId()
{
    return $this->metadataPool->getMetadata(CustomerInterface::class)->generateIdentifier();
}

di.xml:

<type name="Magento\Framework\EntityManager\MetadataPool">
    <arguments>
        <argument name="metadata" xsi:type="array">
            <item name="Magento\Customer\Api\Data\CustomerInterface" xsi:type="array">
                <item name="sequenceTable" xsi:type="string">sequence_customer</item>
            </item>
        </argument>
    </arguments>
</type>     

Advantage:

Disadvantage:

Summary Concept 1 and Concept 2

Both concepts add two queries per new customer.

If we now think about to import 200.000 new customers on AWS, it is much slower than before. For those reasons, we need a concept that helps us to optimize the queries that we send to MySQL.

Concept 3 - optimize queries

We need to split the imported bunch by if already know the customer, this we can reach with a query like.

To implement this, we need to move magento2ce/app/code/Magento/CustomerImportExport/Model/Import/Customer.php:537- 555 in a two-service class.

SELECT entity_id, email
  FROM customer_entity
 WHERE CONCAT(email,website_id) IN (
     'malcolm85@gmail.com1',
     'khuel@yahoo.com1',
     'newcustomer@yahoo.com1'
);

Result:

entity_id email
4290 khuel@yahoo.com
4288 malcolm85@gmail.com

The queried data helps to find all customers that we already have created in Magento. With this Information we can refactor the protected function _importData()

while ($bunch = $this->_dataSourceModel->getNextBunch()) {
    $this->prepareCustomerData($bunch);
    $entitiesToCreate = [];
    $entitiesToUpdate = [];
    $entitiesToDelete = [];
    $attributesToSave = [];

    $bunchDataByMail = [];
    $customerAddresses = [];
    foreach ($bunch as $rowNumber => $rowData) {
        if (!$this->validateRow($rowData, $rowNumber)) {
            continue;
        }
        if ($this->getErrorAggregator()->hasToBeTerminated()) {
            $this->getErrorAggregator()->addRowToSkip($rowNumber);
            continue;
        }
        $email = $rowData[self::COLUMN_EMAIL];
        $customerAddresses[] = $email.$rowData['website_id'];
        $bunchDataByMail[$email] = $rowData;
    }

    $query = $this->_connection->select()->from(
        'customer_entity',
        ['entity_id', 'email']
    )->where('CONCAT(email,website_id) in (?)', $customerAddresses);

    if ($this->getBehavior($rowData) == Import::BEHAVIOR_DELETE) {
        $entitiesToDelete = $this->_connection->fetchCol($query, 'entity_id');
    } elseif ($this->getBehavior($rowData) == Import::BEHAVIOR_ADD_UPDATE) {
        $entitiesToUpdate = $this->_connection->fetchAll($query, 'email');

        /* should filter $validBunchData[$email] by $entitiesToUpdate and split them in two arrays $entitiesToUpdate and $entitiesToCreate*/
    }

With these two arrays, we can use the row by email to create the data to import.

To generate new ids, I recommend to add functions:

Sequence

public function getNextValueBunch(int $count): array
{
    $currentValue = $this->getCurrentValue();
    $nextvalue = $currentValue + $count;
    $this->resource->getConnection($this->connectionName)
        ->insert($this->resource->getTableName($this->sequenceTable), ['sequence_value' => $nextValue]);
    return range($currentValue, $nextValue);
}

EntityMetadata

public function generateIdentifierBunch(int $count) : array
{
    $nextIdentifiers = [];
    if ($this->sequence) {
        $nextIdentifiers = $this->sequence->getNextValueBunch($count);
    }
    return $nextIdentifiers;
}

Advantage:

Concept 4 - Stored Function to generate Sequences

I use as a base the following technical Article https://www.percona.com/blog/2008/04/02/stored-function-to-generate-sequences/

A query that I run before I started performance testing:

CREATE TABLE `sequence` (
  `type` varchar(20) NOT NULL,
  `value` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`type`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO sequence VALUES ('customer', 1);

DELIMITER //
CREATE FUNCTION sequence(seq_type char (20)) RETURNS int
BEGIN
 UPDATE sequence SET value=last_insert_id(value+1) WHERE type=seq_type;
 RETURN last_insert_id();
END
//
DELIMITER ;

Implementation of _getNextEntityId()

protected function _getNextEntityId()
{
  return $this->_connection->query("select sequence('customer')")->fetchColumn();
}

Summary

For all four concepts, we need to have a migration from customer_entity to sequence table idea like migrateSequneceColumnData(customer_entity,entity_id). I think we will get the most benefit with the implementation of Concept 3 because we are sending fewer queries to the database.
All concepts can be implemented in a backward-compatible way because we only touch constructors and protected functions for import.

larsroettig commented 4 years ago

@akaplya @fascinosum can review it?