Closed azazel75 closed 5 months ago
Hi, thanks for reporting this issue. I haven't had the time yet to take a look into it.
Disabling the parallel evaluation by setting pm_parallel
to False, like you did in your fork, would also disable the whole parallelization during the importing, if I understand the parmap documentation correct?
It seems to me that parmap is used to parallelize just the the application of the anonymizations to each row by splitting the work on more threads/processes and nothing else, but maybe I'm missing something?
Same issue, unique
is unusable with the parallelization and by reading the code, I come to same conclusions as @azazel75 about parmap usage
How to move forward on this issue?
add an option to disable parallel ( ex --no-parallel
, --no-optimization
, etc..)
remove parmap usage
Remove or limit support of unique
(ex: warning in the documentation/code)
Same issue,
unique
is unusable with the parallelization and by reading the code, I come to same conclusions as @azazel75 about parmap usageHow to move forward on this issue?
* add an option to disable parallel ( ex `--no-parallel`, `--no-optimization`, etc..) * remove parmap usage * do we need benchmark to know impacts? * Remove or limit support of `unique` (ex: warning in the documentation/code)
Sorry for the late response. I didn't have the time to check whether this is an issue of parmap
itself or in combination with faker
. The parallelization was added as a contribution and a feature request for the anonymizer, so I guess removing it would not be an option for users who have to anonymize massive data.
If it turns out, that uniqueness is not given with parmap
I would tend to disable it with an option flag or remove the support for unique
.
But I am open for help (maybe there is a fix to have uniqueness with parmap
?) and/or other suggestions.
Sorry for the late response. I didn't have the time to check whether this is an issue of
parmap
itself or in combination withfaker
. The parallelization was added as a contribution and a feature request for the anonymizer, so I guess removing it would not be an option for users who have to anonymize massive data.
It's not an issue of parmap
, but a combination with faker
To keep uniqueness, Faker
need to track which value has been used and the same Faker
instance need to be used. But, when doing multi processes with parmap
, the Faker
instance is recreated in each process
Even if we pass an instance of Faker
in parmap
argument, the unique list won't be updated
To make unique working with multi processes, we need to share state between processes which is not trivial (and maybe impact performance)
If it turns out, that uniqueness is not given with
parmap
I would tend to disable it with an option flag or remove the support forunique
.
+1 to add an option to enable/disable multi processing
I know is a breaking change, but what do you think to disable multi-processes by default and add an option to enable it (with a warning about unique
)? IMHO the multi-processing is a more specific use case than using unique
feature
However, tell me your preference, I'll implement the flag
... due to the use of raw parallelization.
To prove it, run the following:
on my machine, when run it I get the following values:
If it worked, it should have been
100 100