stickeritis / sticker

Succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot
Other
25 stars 2 forks source link

Support of custom features #161

Open bratao opened 5 years ago

bratao commented 5 years ago

Hello, Thank you for this awesome library. I´m very impressed by the quality.

My task is to extract/segment information from a semi structured text. But not only the text is important some "external features" are also important.

For example, imagine that I want to segment this text about GitHub projects in Category, Project name, URl and description.

I utilize an BIO scheme to tag each html token as a category.

Where token=NLP start_of_p=True bold=True center=True B-Category token=Projects start_of_p=False bold=True center=True I-Category token=Project start_of_p=True bold=True center=False B-Project-name token=Name start_of_p=False bold=True center=False I-Project-name token=: start_of_p=False bold=False center=False I-Project-name

The final result is something like:

Pay attention that some features are important such as: Text formatting (italic, bold, centered), position in text and more...

There is anyway of using those custom external features for training Sticker?

Thank you!

danieldk commented 5 years ago

Excellent question! This is currently not possible and would be hard to add without breaking compatibility with existing models. We are currently working towards a stable 1.0 version, which should be done in several months at most. Until 1.0 is released (and then maintained as a separate branch), compatibility is the highest priority.

Once 1.0 is branched, we can start working again on features that change the configuration format, graph placeholders, etc. One of the larger plans for sticker 2 is to make sticker more flexible in the inputs (free-er form features) and outputs (multi-task prediction) that can be configured.

This may all take a while, because we also want to investigate a switch to libtorch for sticker 2, which should probably be done before adding new features (with a Tensorflow-specific implementation).

I will start a sticker-2 milestone and add this issue to it.