Data classification, Sensitive data identification?

pandora-analysis / pandora

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

https://pandora.circl.lu/

GNU Affero General Public License v3.0

250 stars 36 forks source link

Data classification, Sensitive data identification? #392

Open juju4 opened 1 year ago

juju4 commented 1 year ago

It would be good if pandora could

extract a data classification if present in document
highlight if sensitive data is present and matching patterns: typically credentials but also PII, PHI. Few tools that could be used:

Note: this could be useful for both file and text input. For example, user could use the internal pandora to validate a text before sending to an external llm as prompt or online tool/spell/translate/whatever

Rafiot commented 11 months ago

Regarding data classification, can you explain more what you mean? It might be possible if the classification is in the metadata, but I'm not sure how do to that efficiently in any other situation.

I'll look at the tools you mentioned, especially the yelp one as it is already a python module. If you're already working on a module, please le tme know so I don't reinvent the wheel.

Just a note regarding the LLM part and generally sharing with 3rd party: I'd not trust anything automated to properly detect PII/secrets before sending them to a 3rd party blackbox, so this is never going to be supported officially by pandora. A human will always have to take the responsibility for that kind of behaviors.

juju4 commented 11 months ago

I'm not looking to remove human from decision, just try to help them make it. Idea was if you have an internal pandora instance where in best case, people get used to submit their office files, having at same place a reminder that the file/content has a classification banner or file metadata or is identified with sensitive data would be a nice helper.

The classification identification outside of metadata would just be a text pattern match with some example scales (BAIL/BAF from https://help.libreoffice.org/latest/en-US/text/shared/guide/classification.html and TLP from https://www.first.org/tlp/) that could be customized to match internal naming.

Not working on a module.