joshuacnf / Ctrl-G

71 stars 12 forks source link

Add populate_edge() in utils.py. #5

Closed brunabazaluk closed 2 months ago

brunabazaluk commented 3 months ago

This simple function receives a list of words accepted by an edge and returns the bitset that represents the corresponding tokens.

brunabazaluk commented 3 months ago

Would it be ok to add to this PR an update at the tutorial giving an example using the function to create a DFA? It would be something like this:

dfa_graph = {
    "edges": [
    (0, 1, ctrlg.populate_edge(["A", "B"], vocab_size, tokenizer)),
    (1, 0, ctrlg.populate_edge(["+","-","*","/"], vocab_size, tokenizer)),
    (1, 2, ctrlg.populate_edge(["="], vocab_size, tokenizer)),
    (2, 2, ctrlg.populate_edge(vocab_size=vocab_size, ALL=True)),
    ],
    "initial_state": 0,
    "accept_states": set([2]),
}
joshuacnf commented 3 months ago

Since different tokenizers can have very different behaviors, and some words are tokenized as multiple tokens (however each edge in the DFA should consists a list of single tokens), this function is probably not suitable for most applications. I will include a small example for custom DFAs in the README. Thanks.