Why doing this?
As stated and repeated, the interset process performs a Mapreduction in-memory with data from the data packages (quick), but then, performs another Mapreduction with records already present in the Prosim-db which were created during previous sliced imports. This latter Mapreduction is a very heavy process to deal with for the MongoDb (performed for each record in Memory but still, a lot of read, write, expand, indexing staff in the db).
Hence it could be interesting to determine a maximum amount of products for which a 1-shot integration could be performed (only in-memory Mapreductions). Thus would let us gain a considerable amount of time (no scheduled tasks 2 an hour anymore) and possibly could the Prosim-db be generated from scratch in less than a day instead of several days.
How to proceed?
Number of products with appropriate non-empty tags for making the comparison between products is limited to about 20% of the OFF official db:
about 110.000 / 550.000 products
Check what happens in terms of resources used (memory, disk speed/space, overall behaviour) if we decide to create the Prosim-db in 1 shot by setting the environment as follows:
feeder_1 has extracted all 110.000 meeting non empty criteria for "nutrition_score_uk" and "categories_tags" => all_products.json
copy all_products.json into updated_products.json
in preparer/config.xml, set tags with these values:
\<width>120000\</width>
\<height>120000\</height>
\<stats_H_nb_products>_nb products extracted in allproducts.json\</stats_H_nb_products>
\<stats_W_nb_products>_nb products extracted in allproducts.json\</stats_W_nb_products>
preparer/progress.xml: clear values of the tags to start with a new Prosim-db
intersect/config.xml : set max db size to 500GB
\<max_db_size_gigabytes>500\</max_db_size_gigabytes>
Requirements: Issue #1 implemented
Why doing this? As stated and repeated, the interset process performs a Mapreduction in-memory with data from the data packages (quick), but then, performs another Mapreduction with records already present in the Prosim-db which were created during previous sliced imports. This latter Mapreduction is a very heavy process to deal with for the MongoDb (performed for each record in Memory but still, a lot of read, write, expand, indexing staff in the db). Hence it could be interesting to determine a maximum amount of products for which a 1-shot integration could be performed (only in-memory Mapreductions). Thus would let us gain a considerable amount of time (no scheduled tasks 2 an hour anymore) and possibly could the Prosim-db be generated from scratch in less than a day instead of several days.
How to proceed? Number of products with appropriate non-empty tags for making the comparison between products is limited to about 20% of the OFF official db: about 110.000 / 550.000 products Check what happens in terms of resources used (memory, disk speed/space, overall behaviour) if we decide to create the Prosim-db in 1 shot by setting the environment as follows: