le-phare / import-bundle

Symfony bundle to import CSV files into a database
MIT License
9 stars 0 forks source link

Question about the validity of my data in a PRE_COPY event #15

Closed Saami783 closed 6 hours ago

Saami783 commented 1 week ago

Hello,

I want to add checks on the validity of my data in a PRE_COPY event. Specifically, I want to verify if my data has the correct expected format, if there are no null values in any fields, etc. However, I do not understand how the bundle operates when I perform these checks. Should I retrieve the data from the temporary import table, perform my checks on this data, then truncate the table to insert the valid data back into the same table, and finally let the bundle handle the copying?

thislg commented 1 week ago

See my example with valid IS true copy condition in issue https://github.com/le-phare/import-bundle/issues/2. In a POST_LOAD event subscriber you can set the valid boolean to false or true based on your custom logic. A common use case is running SQL queries on your temp table to validate the data. Then let the bundle handle the copying: lines with valid values set to false won't be copied.

Saami783 commented 1 week ago

Or, can we validate the current occurrence? For example, the bundle checks the first occurrence in the table, then my program checks the same row immediately afterward, and so on until there are no more data to validate. This would save me from having to retrieve all the data and manipulate the import database.

Saami783 commented 1 week ago

See my example with valid IS true copy condition in issue #2. In a POST_LOAD event subscriber you can set the valid boolean to false or true based on your custom logic. A common use case is running SQL queries on your temp table to validate the data. Then let the bundle handle the copying: lines with valid values set to false won't be copied.

Ok thank you for your answer, so I have an extra_field isValid like this in my configuration :

resources:
  import:
    tablename: scratch.import_file_tmp
    load:
      add_file_line_number: true
      # strategy: https://github.com/le-phare/import-bundle/blob/master/docs/configure/load.md#strategy
      pattern: '.*poke.*'
      format_options:
        with_header: true
        field_delimiter: "\t"
        line_delimiter: "\n"
      fields: # https://github.com/le-phare/import-bundle/blob/master/docs/configure/load.md#:~:text=fields%20et%20extra_fields-,fields,-and%20extra_fields%20have
        NAME: ~
        family: ~
        pv: ~
        pf: ~
        pm: ~
      extra_fields: # https://github.com/le-phare/import-bundle/issues/2
        calcul: integer
        isValid:
          type: boolean
          options:
            default: NULL

    copy:
      target: poke
      strategy: insert # https://github.com/le-phare/import-bundle/blob/master/docs/configure/copy.md#:~:text=MySQL%2C%20PostgreSQL%209.5%2B-,Executes,-an%20UPDATE%20in
      strategy_options:
        copy_condition: isValid IS TRUE
        distinct: true
        conflict_target: name
        non_updateable_fields: ["name"]
      mapping:
        name: name
        family: family
        pv: pv
        pf: pf
        pm: pm

And now, assuming I want to validate my data where the pv field must not be null and it is an integer, how can I do this in my POST_LOAD event? Without considering the possible options in the configuration file such as:

<field_1_name>:
    type: string
    options:
        notnull: false
class OnPostLoadSubscriber implements EventSubscriberInterface
{

    public function __construct(private FormFactoryInterface $formFactory,
                                private Connection $connection,
                                private LoggerInterface $logger) {
    }

    /**
     * @throws Exception
     */
    public function onImportPostLoad($event): void
    {
        $config = $event->getConfig()->get('resources')['import']->getConfig();

        $tableName = explode('.', $config['tablename']);
        $tableName = $tableName[1];

        $sql = "SELECT * FROM {$tableName}";

        $this->connection->beginTransaction();

        $stmt = $this->connection->executeQuery($sql);
        $entities = $stmt->fetchAllAssociative();

        foreach ($entities as $entityData) {
            $fileLineParts = explode(':', $entityData['file_line_no']);
            $id = $fileLineParts[2];
            $poke = new Poke();

            $poke->setId((int) $id);
            $poke->setName($entityData['name']);
            $poke->setFamily($entityData['family']);
            $poke->setPv($entityData['pv']);
            $poke->setPf($entityData['pf']);
            $poke->setPm($entityData['pm']);
            $poke->setCalcul($entityData['calcul']);

            echo $poke . "\n";

            $form = $this->formFactory->create(PokeType::class, $poke);
            $form->submit($poke);

            if ($form->isSubmitted() && $form->isValid()) {
                $sqlUpdate = "UPDATE {$tableName} SET isvalid = :isvalid WHERE file_line_no LIKE :file_line_no";
                $this->connection->executeQuery($sqlUpdate, [
                    'isvalid' => 1,
                    'file_line_no' => "%:" . $id,
                ]);
            } else {
                $errors = (string) $form->getErrors(true, false);
                $this->logger->alert("Validation errors for Poke ID {$id}: " . $errors . "\n");
            }
        }
        $this->connection->commit();
        $this->connection->close();
    }
    public static function getSubscribedEvents(): array
    {
        return [
            'import.post_load' => 'onImportPostLoad',
        ];
    }
}

In order to understand the validation logic and enforce my constraints (using Symfony forms, etc.), do I need to retrieve all the data from the table with an SQL query and validate them one by one, ensuring to indicate whether isValid = true or false?

thislg commented 4 days ago

Your method of using Symfony Form requires iterating over the data which can be slow and inefficient. For simple checks without logging you can use copy_condition directly: copy_condition: temp.pv IS NOT NULL AND pg_input_is_valid(temp.pv, 'integer'). There is another way by using an UPDATE query to set your isvalid flag for the whole table in your event subscriber:

UPDATE {$tableName} SET isvalid = FALSE WHERE temp.pv IS NULL OR !pg_input_is_valid(temp.pv, 'integer')

Then if you want to log invalid lines, use SELECT file_line_no FROM scratch.import_file_tmp WHERE isvalid = FALSE request to retrieve them and call your logger.

Saami783 commented 1 day ago

Ok thank you for your answer @thislg

Saami783 commented 7 hours ago

Is this available on MariaDB & MySQL? Because here, the prefix pg seems to indicate PostgreSQL ... and I'm not sure if the other engines offer it.

thislg commented 6 hours ago

I gave you an example in Postgresql but you can use any function available in your DBMS.