krayzpipes / txt-ferret

Identify and classify data in your text files with Python.
Apache License 2.0
4 stars 1 forks source link
data-loss-prevention data-protection dlp gdpr intellectual-property pci-compliance python python3 regex security sensitive-data-discovery

txtferret

Identify and classify data in your text files with Python.

Description

Definition: txtferret

Use custom regular expressions and sanity checks (ex: luhn algorithm for account numbers) to find sensitive data in virtually any size file via your command line.

Why use txtferret? See the How/why did this come about? section below.

Table of Contents

Quick Start

PyPi

  1. Install it

    $ pip3 install txtferret

Repo

  1. Clone it.
    $ git clone git@github.com:krayzpipes/txt-ferret.git
    $ cd txt-ferret
  2. Setup environment.
    $ python3.7 -m venv venv
    $ source venv/bin/activate
  3. Install it.
    (venv) $ python setup.py install

Run it

Configuration

There are two ways to configure txt-ferret. You can make changes or add filters through making a custom configuration file (based on the default YAML file) or you can add some settings via CLI switches.

Txt-ferret comes with a default config which you can dump into any directory you wish and change it or use it for reference. If you change the file, you have to specifiy it with the appropriate CLI switch in order for the script to use it. See the CLI section below.

(venv) $ txtferret dump-config /file/to/write/to.yaml

There are two sections of the config file: filters and settings.

Filters

Filters are regular expressions with some metadata. You can use this metadata to perform sanity checks on regex matches to sift out false positives. (Ex: luhn algorithm for credit card numbers). You can also mask the output of the matched string as it is logged to a file or displayed on a terminal.

filters:
- label: american_express_15_ccn
  pattern: '((?:34|37)\d{2}(?:(?:[\W_]\d{6}[\W_]\d{5})|\d{11}))'
  substitute: '[\W_]'
  exclude_patterns: ["dont_match_me", "dont_match_me_either"]
  sanity: luhn
  mask:
    index: 2,
    value: XXXXXXXX
  type: Credit Card Number

Settings

settings:
  mask: No
  log_level: INFO
  summarize: No
  output_file:
  show_matches: Yes
  delimiter:
  ignore_columns: [1, 5, 6]
  file_encoding: 'utf-8'

There are a few shortcomings with commercial Data Loss Prevention (DLP) products:

Txtferret was born out after realizing some of these limitations. It isn't perfect, but it's a great sanity check which can be paired with a DLP solution. Here are some things it was designed to do:

Releases

Version 0.3.0a - 2019-09-05

Version 0.2.1 - 2019-08-11

Version 0.1.3 - 2019-08-05

Development

Some info about development.

Running Tests

$ pytest txt-ferret/tests/

Contributing

Process

  1. Create an issue.
  2. Fork the repo.
  3. Do your work.
  4. WRITE TESTS
  5. Make a pull request.
    • Preferably, include the issue # in the pull request.

Style

License

See License