protectai / nbdefense

Secure Jupyter Notebooks and Experimentation Environment
Apache License 2.0
55 stars 10 forks source link

Add support for other spaCy models for PII detection #64

Open einarbmag opened 10 months ago

einarbmag commented 10 months ago

Is your feature request related to a problem? Please describe.

We want to install NB Defense in resource-constrained environments. The hard-coded en_core_web_trf requirement for PII detection takes up significant amount of memory, and ideally requires a GPU to run reasonably fast.

Describe the solution you'd like

I would like to be able to install any spaCy model I want (e.g. en_core_web_md) and specify which model to use for PII detection using an environment variable or CLI argument.

badarahmed commented 10 months ago

Thanks for filing the issue. We tested with different spacy models and found the results with the non-transformer models disappointing. Transformer model is definitely more resource hungry so it makes sense to try to use smaller models in a resource constrained environment (if you have to).

Please feel free to file a PR. The spacy model is being set in: https://github.com/protectai/nbdefense/blob/d62274e835ee9411262e9c2d664d9a5195b40713/nbdefense/plugins/pii.py#L394

and settings can be changed in following places: https://github.com/protectai/nbdefense/blob/d62274e835ee9411262e9c2d664d9a5195b40713/nbdefense/constants.py#L19 https://github.com/protectai/nbdefense/blob/d62274e835ee9411262e9c2d664d9a5195b40713/nbdefense/settings.py#L9