mailgun / talon

Apache License 2.0
1.26k stars 287 forks source link

Not able to use Custom Classifier #203

Open yatin-rxlogix opened 4 years ago

yatin-rxlogix commented 4 years ago

I have used following statement to train a Classifier on my Custom Data Set but I am not able to use this Custom Classifier for Signature Extraction. Can somebody help in this issue as where am I doing wrong.

Rokfordchez commented 4 years ago

When you create you own dataset in "/path/to/your/P/folder" Execution of:

build_extraction_dataset(os.path.join(settings.BASE_DIR, 'data', 'P'),
                         os.path.join(get_python_lib(), 'talon/signature/data/train.data'))

build_extraction_dataset change file 'talon/signature/data/train.data' with you "/path/to/your/P/folder" data

Then you train classifier with new 'talon/signature/data/train.data':

c.train(c.init(), os.path.join(get_python_lib(), 'talon/signature/data/train.data'),
        os.path.join(get_python_lib(), 'talon/signature/data/classifier'))

execution of this code change 'talon/signature/data/classifier'

When you call talon.init() it execute:

def init():
    register_xpath_extensions()
    if ML_ENABLED:
        signature.initialize()

signature.initialize() call:

EXTRACTOR_FILENAME = os.path.join(DATA_DIR, 'classifier')
EXTRACTOR_DATA = os.path.join(DATA_DIR, 'train.data')

def initialize():
    extraction.EXTRACTOR = classifier.load(EXTRACTOR_FILENAME,
                                           EXTRACTOR_DATA)

in extraction.py in _mark_lines call EXTRACTOR as classifier in is_signature_line

So, after train classifier EXTRACTOR_DATA and EXTRACTOR_FILENAME already have get you email raw data with #sig#. And after call talon.init() you use your training classifier