Open einarbmag opened 10 months ago
Thanks for filing the issue. We tested with different spacy models and found the results with the non-transformer models disappointing. Transformer model is definitely more resource hungry so it makes sense to try to use smaller models in a resource constrained environment (if you have to).
Please feel free to file a PR. The spacy model is being set in: https://github.com/protectai/nbdefense/blob/d62274e835ee9411262e9c2d664d9a5195b40713/nbdefense/plugins/pii.py#L394
and settings can be changed in following places: https://github.com/protectai/nbdefense/blob/d62274e835ee9411262e9c2d664d9a5195b40713/nbdefense/constants.py#L19 https://github.com/protectai/nbdefense/blob/d62274e835ee9411262e9c2d664d9a5195b40713/nbdefense/settings.py#L9
Is your feature request related to a problem? Please describe.
We want to install NB Defense in resource-constrained environments. The hard-coded en_core_web_trf requirement for PII detection takes up significant amount of memory, and ideally requires a GPU to run reasonably fast.
Describe the solution you'd like
I would like to be able to install any spaCy model I want (e.g. en_core_web_md) and specify which model to use for PII detection using an environment variable or CLI argument.