Identify and classify data in your text files with Python.
Definition: txtferret
Use custom regular expressions and sanity checks (ex: luhn
algorithm for account numbers) to find
sensitive data in virtually any size file via your command line.
Why use txtferret? See the How/why did this come about? section below.
Install it
$ pip3 install txtferret
$ git clone git@github.com:krayzpipes/txt-ferret.git
$ cd txt-ferret
$ python3.7 -m venv venv
$ source venv/bin/activate
(venv) $ python setup.py install
Example file size:
# Decent sized file.
$ ls -alh | grep my_test_file.dat
-rw-r--r-- 1 mrferret ferrets 19G May 7 11:15 my_test_file.dat
Scanning the file.
# Scan the file.
# Default behavior is to mask the string that was matched.
$ txtferret scan my_test_file.dat
2019:05:20-22:18:01:-0400 Beginning scan for /home/mrferret/Documents/test_file_1.dat
2019:05:20-22:18:18:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_1.dat - Filter: fake_ccn_account_filter, Line 712567, String: 100102030405060708094
2019:05:20-22:19:09:-0400 Finished scan for /home/mrferret/Documents/test_file_1.dat
2019:05:20-22:19:09:-0400 SUMMARY:
2019:05:20-22:19:09:-0400 - Matched regex, failed sanity: 2
2019:05:20-22:19:09:-0400 - Matched regex, passed sanity: 1
2019:05:20-22:19:09:-0400 Finished in 78 seconds (~1 minutes).
Scanning the file with a delimiter.
# Break up each line of a CSV into columns by declaring a comma for a delimiter.
# Scan each field in the row and return column numbers as well as line numbers.
$ txtferret scan --delimiter , test_file_1.csv
2019:05:20-21:41:57:-0400 Beginning scan for /home/mrferret/Documents/test_file_1.csv
2019:05:20-21:44:34:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_1.csv - Filter: fake_ccn_account_filter, Line 712567, String: 100102030405060708094, Column: 171
2019:05:20-21:49:16:-0400 Finished scan for /home/mrferret/Documents/test_file_1.csv
2019:05:20-21:49:16:-0400 SUMMARY:
2019:05:20-21:49:16:-0400 - Matched regex, failed sanity: 2
2019:05:20-21:49:16:-0400 - Matched regex, passed sanity: 1
2019:05:20-21:49:16:-0400 Finished in 439 seconds (~7 minutes).
stdout
# Uses multiprocessing to speed up scans of a bulk group of files
$ txtferret scan -o bulk_testing.log --bulk ../test_files/ 2019:06:09-15:15:27:-0400 Detected non-text file '/home/mrferret/Documents/test_file_1.dat.gz'... attempting GZIP mode (slower). 2019:06:09-15:15:27:-0400 Detected non-text file '/home/mrferret/Documents/test_file_2.dat.gz'... attempting GZIP mode (slower). 2019:06:09-15:15:27:-0400 Beginning scan for /home/mrferret/Documents/test_file_1.dat.gz 2019:06:09-15:15:27:-0400 Beginning scan for /home/mrferret/Documents/test_file_2.dat.gz 2019:06:09-15:15:27:-0400 Beginning scan for /home/mrferret/Documents/test_file_3.dat 2019:06:09-15:15:27:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_2.dat.gz - Filter: fake_ccn_account_filter, Line 4, String: 100102030405060708094 2019:06:09-15:15:27:-0400 Finished scan for /home/mrferret/Documents/test_file_2.dat.gz 2019:06:09-15:16:04:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_3.dat - Filter: fake_ccn_account_filter, Line 712567, String: 100102030405060708094 2019:06:09-15:16:51:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_1.dat.gz - Filter: fake_ccn_account_filter, Line 712567, String: 100102030405060708094 2019:06:09-15:17:15:-0400 Finished scan for /home/mrferret/Documents/test_file_3.dat 2019:06:09-15:19:24:-0400 Finished scan for /home/mrferret/Documents/test_file_1.dat.gz 2019:06:09-15:19:24:-0400 SUMMARY: 2019:06:09-15:19:24:-0400 - Scanned 3 file(s). 2019:06:09-15:19:24:-0400 - Matched regex, failed sanity: 16 2019:06:09-15:19:24:-0400 - Matched regex, passed sanity: 3 2019:06:09-15:19:24:-0400 - Finished in 236 seconds (~3 minutes). 2019:06:09-15:19:24:-0400 FILE SUMMARIES: 2019:06:09-15:19:24:-0400 Matches: 1 passed sanity checks and 2 failed, Time Elapsed: 236 seconds / ~3 minutes - /home/mrferret/Documents/test_file_1.dat.gz 2019:06:09-15:19:24:-0400 Matches: 1 passed sanity checks and 3 failed, Time Elapsed: 0 seconds / ~0 minutes - /home/mrferret/Documents/test_file_2.dat.gz 2019:06:09-15:19:24:-0400 Matches: 1 passed sanity checks and 2 failed, Time Elapsed: 107 seconds / ~1 minutes - /home/mrferret/Documents/test_file_3.dat
There are two ways to configure txt-ferret. You can make changes or add filters through making a custom configuration file (based on the default YAML file) or you can add some settings via CLI switches.
Txt-ferret comes with a default config which you can dump into any directory you wish and change it or use it for reference. If you change the file, you have to specifiy it with the appropriate CLI switch in order for the script to use it. See the CLI section below.
(venv) $ txtferret dump-config /file/to/write/to.yaml
There are two sections of the config file: filters
and settings
.
Filters are regular expressions with some metadata. You can use this metadata to perform sanity checks on regex matches to sift out false positives. (Ex: luhn algorithm for credit card numbers). You can also mask the output of the matched string as it is logged to a file or displayed on a terminal.
filters:
- label: american_express_15_ccn
pattern: '((?:34|37)\d{2}(?:(?:[\W_]\d{6}[\W_]\d{5})|\d{11}))'
substitute: '[\W_]'
exclude_patterns: ["dont_match_me", "dont_match_me_either"]
sanity: luhn
mask:
index: 2,
value: XXXXXXXX
type: Credit Card Number
re
module in the standard library.'(555-(867|555)-5309)'
'(555-(?:867|555)-5309)'
?:
.exclude_patterns
, it will not be included in the results.'[\W_]
.luhn
algorithm will validate they could potentially be an account number and
reduce false positives.settings:
mask: No
log_level: INFO
summarize: No
output_file:
show_matches: Yes
delimiter:
ignore_columns: [1, 5, 6]
file_encoding: 'utf-8'
-b
or --bulk
.$ txtferret scan --bulk /home/mrferret/Documents
mask
-m
switch can be used to turn on masking'
$ txtferret scan -m ../fake_ccn_data.txt
2017:05:20-00:24:52:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 10XXXXXXXXXXXXXXXXXXX
$ txtferret scan ../fake_ccn_data.txt 2017:05:20-00:26:18:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 100102030405060708094
summarize
-s
switch will kickoff the summary.
$ txtferret scan ../fake_ccn_data.txt
2019:05:20-00:36:00:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 100102030405060708094
$ txtferret scan -s ../fake_ccn_data.txt 2019:05:20-01:05:29:-0400 SUMMARY: 2019:05:20-01:05:29:-0400 - Matched regex, failed sanity: 1 2019:05:20-01:05:29:-0400 - Matched regex, passed sanity: 1 2019:05:20-01:05:29:-0400 Finished in 0 seconds (~0 minutes)
-o
switch to set an output file.
$ txtferret scan -o my_output.log file_to_scan.txt
delimiter
b
followed by the code. For example, b1
will
use Start of Header as a delimiter (\x01 in hex)-d
switch to set a delimiter and scan per column instead of line.
$ txtferret scan ../fake_ccn_data.txt
2019:05:20-00:36:00:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 100102030405060708094
$ txtferret scan -d , ../fake_ccn_CSV_file.csv 2019:05:20-01:12:18:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 100102030405060708094, Column: 3
- **ignore_columns**
- This setting is ignored if the `delimiter` setting or switch is not set.
- Add a list of integers and txtferret will skip those columns.
- If `ignore_columns: [2, 6]` is configured and a csv row is `hello,world,how,are,you,doing,today`, then
`world` and `doing` will not be scanned but will be ignored.
- This is particularly useful in columnar datasets when you know there is a column that is full of false positives.
- **file_encoding**
- Two uses:
- Used to encode your `delimiter` value to the appropriate encoding of your file.
- Used to encode the data matched in the file before being applied to sanity check.
- Default value is `'utf-8'`
# How/why did this come about?
There are a few shortcomings with commercial Data Loss Prevention (DLP) products:
Txtferret was born out after realizing some of these limitations. It isn't perfect, but it's a great sanity check which can be paired with a DLP solution. Here are some things it was designed to do:
luhn
algorithm will sift out many false positives for credit card numbers. The matched
credit card number will be run through the luhn
algorithm first. If it doesn't pass, it is discarded.exclude_patterns
to filterstokenize
to mask
because tokenize was a lie.. it's masking.
--no-tokenize
switch with --mask
switch.file_encoding
setting for multi-encoding support.
'utf-8'
encoding by default.
substitute
option to filters.
config-override
option.ignore_columns
setting.
--bulk
switch.Some info about development.
$ pytest txt-ferret/tests/
See License