rap2hpoutre / fast-excel

🦉 Fast Excel import/export for Laravel
MIT License
2.05k stars 247 forks source link

[Enhancement] Import big data using chucks #243

Open mcolominas opened 2 years ago

mcolominas commented 2 years ago

From what I have commented on #162, this package is missing an option to import using chucks.

Unfortunately, it could be considered normal. TBH, this lib could be super optimized to export millions of rows (https://dev.to/rap2hpoutre/export-10m-rows-in-xlsx-with-laravel-without-memory-issues-6bk), but importing has many rooms for improvements! Currently, each row is read from the file sequentially, then returned to a variable that you can use (the big collection). If you want to preserve memory you could avoid returning the variable and directly writing your entries in database:

(new FastExcel)->import('file.xlsx', function ($line) {
    return User::create([
        'name' => $line['Name'],
        'email' => $line['Email']
    ]);
});

Still it will not reduce the import time. And maybe you could have memory issues too. I guess we could find a way to import big files and preserving memory but I have no idea right now. Maybe you could use Spout (https://opensource.box.com/spout/) which is the lib that is used under the hood to make FastExcel work.

The way of importing of @rap2hpoutre is not very optimal, importing record by record consumes a lot of time and resources, to improve all this, try to insert from 5 to 7 thousand records at one time, this library does not incorporate something to insert massively using chucks , the ideal would be to do something like this:

(new FastExcel())->chunk(7000)->import($file_name, function ($chunk_data) {
    //$chunk_data = 7000 records
});
//Or
(new FastExcel())->importChunk($file_name, 7000, function ($chunk_data) {
    //$chunk_data = 7000 records
});

As this function does not exist, you have to create this function yourself, for example something like this:

dispatch(function () {
    $insertData = function (&$chunk_data) {
        TestExports::insert($chunk_data);
        $chunk_data = [];
    };

    $file_name = base_path('test_data.xlsx');
    $count = 0;
    $chunk_data = [];

    (new FastExcel())->import($file_name, function ($data) use (&$count, &$chunk_data, $insertData, $mapData) {
        $chunk_data[] = $data;
        if (++$count % 7000 == 0)
            $insertData($chunk_data);
    });
    $insertData($chunk_data);
});

This piece of code, massively insert data in chunks of 7000 records, in addition to launching it as a job, so that the user does not see the waiting page.

Instead of using dispatch I would create a Job class to enter all the code, from what I have said, this is a quick example to import a resource in a short time and resources.

I have generated an excel with 315.141 registers and the import time has been 82s and 97.27mb, taking into account that the excel only weighs 9mb of ram, I still see that it consumes a lot, but it is much better than those 8 GB that you say.

We must also take into account that of these 76 seconds, we must add the validation and manipulation of the data, in this example, I have inserted a perfect excel, therefore I have omitted those 2 steps, but performing those 2 steps, the time may increase.

The idea would be to implement a function that receives 4 parameters:

public function importUsingChunk($path, int $chunk_size, callable $callback_rows, ?callable $callback_map = null){}
//$path: Path of the file to import.
//$chunk_size: Chunk size.
//$callback_rows($chunk_rows): Obtains by parameter an array that contains X rows of the current chunk, here the data will be inserted into the database.
//$callback_map($row): It receives a single row by parameter, which will be used to validate and / or modify the values of said row, I have added this function to improve performance, and in this way avoid doing a foreach of the X rows of the chunk.
atymic commented 2 years ago

I'd love this feature. Happy to PR if others want?

Excelautomation commented 2 years ago

Is there a reason why you aren't just chunking and upsertting/inserting it after getting the results gathered? I prefer using upsert because it works like updateOrCreate if the DB Schema property is set to Unique.

Like for instance:

$results = FastExcel::import(Storage::disk("public")->path($filename), function ($line) {
    return [
        'sku' => $line["Part Number"],
    ]
}

collect($results)
    ->chunk(10000)
    ->each(function ($chunk) {
        Product::upsert($chunk->toArray(), 'sku');
    });
atymic commented 2 years ago

If the spreadsheet is huge, this cases memory issues pretty quickly. I guess could just up the memory, but since the file is read line by line should be easy to just yield each chunk from a generator similar :)

zieru commented 2 years ago

My company import many and hundreds of csv everyday. currently I'm using goodby/csv,

but somehow I want to use fast-excel unfortunately fast-excel always exceed the ram usage (over 2GB). where using goodby/csv I can use only 64MB of ram just fine.

I really hope fast-excel could do better with importing