openpeeps / toktok

Generic tokenizer written in Nim language 👑 Powered by std/lexbase and Nim's Macros
https://openpeeps.github.io/toktok/
MIT License
31 stars 0 forks source link

Error: undeclared identifier: 'TK_UNKNOWN' #1

Closed JewishLewish closed 1 year ago

JewishLewish commented 1 year ago

image

import tables
import toktok
import lexbase

tokens:
    Plus      > '+'
    Minus     > '-'
    Multi     > '*'
    Div       > '/'
    Assign    > '='
    Comment   > '#' .. EOL      # anything from `#` to end of line
    CommentAlt > "/*" .. "*/"   # anything starting with `/*` to `*/`
    Var       > "var"
    Let       > "let"
    Const     > "const"
    BTrue     > @["TRUE", "True", "true", "YES", "Yes", "yes", "y"]
    BFalse    > @["FALSE", "False", "false", "NO", "No", "no", "n"]

when isMainModule:
    var lex = Lexer.init(fileContents = readFile("sample.txt"))
    if lex.hasError:
        echo lex.getError
    else:
        while true:
            var curr = lex.getToken()           # tuple[kind: TokenKind, value: string, wsno, col, line: int]
            echo curr

I get these errors when just running the default sample code

JewishLewish commented 1 year ago

Looking inside the code, it shows that it has something to do with this code: image

georgelemon commented 1 year ago

Thanks! Just updated the README.

Call settings proc before tokens macro, with uppercase enabled and this prefix. I will find a solution for hardcoded TK_UNKNOWN

static:
    Program.settings(
        uppercase = true,
        prefix = "Tk_"
    )

For more inspiration you can check Tim Engine https://github.com/openpeep/tim

JewishLewish commented 1 year ago

There also seems to be an error with adding values.

I added "LColon" and "RColon" into the token list:

tokens:
    Plus      > '+'
    Minus     > '-'
    Multi     > '*'
    Div       > '/'
    LCol      > '('
    RCol      > ')'

However, when it outputs values: it doesn't recognize it.

(kind: TK_LCOL, value: "", wsno: 0, line: 1, col: 0, pos: 0)
(kind: TK_STRING, value: "test", wsno: 0, line: 1, col: 1, pos: 1)
(kind: TK_RCOL, value: "", wsno: 0, line: 1, col: 7, pos: 7)

The file contents are:

("test")
georgelemon commented 1 year ago

Seems fine to me, you have TK_LCOL (pos 0) and TK_RCOL (pos 7).

If you're talking about the empty value field, well, there is no need to store (, ) characters

JewishLewish commented 1 year ago

Ah I didn't notice the "pos" / .col syntax. In that case, it works perfectly!

JewishLewish commented 1 year ago

Is there a Program.settings for removing whitespace?

georgelemon commented 1 year ago

No. Whitespaces are counted and stored in wsno field

georgelemon commented 1 year ago

Not added in docs, but here is how you can handle 2 or more characters for multiple use cases

So for example > which can be >=

tokens:
    GT           > '>':
        GTE      ? '='

    LT           > '<':
        LTE      ? '='

     # string based tokens TK_AT, TK_INCLUDE and TK_MIXIN
    At           > '@':
        Include  ? "include"
        Mixin    ? "mixin"
JewishLewish commented 1 year ago

Not added in docs, but here is how you can handle 2 or more characters for multiple use cases

So for example > which can be >=

tokens:
    GT           > '>':
        GTE      ? '='

    LT           > '<':
        LTE      ? '='

     # string based tokens TK_AT, TK_INCLUDE and TK_MIXIN
    At           > '@':
        Include  ? "include"
        Mixin    ? "mixin"

I mean if I had:

echo "Test"

echo "Test2"

How do you seperate multi-line when it doesn't even tokenize NEWLINE ("\n") The only way to do (or atleast I can try to do) is add ";" on the end of the command and make it so if the code sees that then it should break apart and run the current command and continue with the next.

georgelemon commented 1 year ago

There is no need for tokenizing new lines or whitespaces. Imagine how many useless tokens would be. That's why you have wsno and line field for each TokenTuple returned by getToken proc.

So based on your example will be

(kind: TK_ECHO, value: "echo", wsno: 0, line: 1, col: 0, pos: 0)
(kind: TK_STRING, value: "Test", wsno: 1, line: 1, col: 5, pos: 5)
(kind: TK_ECHO, value: "echo", wsno: 0, line: 2, col: 0, pos: 0)
(kind: TK_STRING, value: "Test", wsno: 1, line: 2, col: 5, pos: 5)
georgelemon commented 1 year ago

Anything else should be implemented at Parser level and AST

JewishLewish commented 1 year ago

There is no need for tokenizing new lines or whitespaces. Imagine how many useless tokens would be. That's why you have wsno and line field for each TokenTuple returned by getToken proc.

So based on your example will be

(kind: TK_ECHO, value: "echo", wsno: 0, line: 1, col: 0, pos: 0)
(kind: TK_STRING, value: "Test", wsno: 1, line: 1, col: 5, pos: 5)
(kind: TK_ECHO, value: "echo", wsno: 0, line: 2, col: 0, pos: 0)
(kind: TK_STRING, value: "Test", wsno: 1, line: 2, col: 5, pos: 5)

That is smart but there is the challenge of telling the code that "This is a new line." I could make it so it check each line number and when there is a new line then it would execute the code and continue. 🤔

JewishLewish commented 1 year ago

edit: I am starting to understand what "wsno" is now because I was confused on where it was getting that value. That is smart!!!

My man you truly have won the internet.

georgelemon commented 1 year ago

Haha. Glad you like this! There are still many things to do.

And this would be a minimal parser for your toktok

type
    Parser* = object
        lex: Lexer
        prev, current, next: TokenTuple
        error: string

proc setError(p: var Parser, msg: string) =
    ## Set parser error
    p.error = "Error ($2:$3): $1" % [msg, $p.current.line, $p.current.pos]

proc walk(p: var Parser, offset = 1) =
    var i = 0
    while offset != i:
        p.prev = p.current
        p.current = p.next
        p.next = p.lex.getToken()
        inc i

var p = Parser(lex: Lexer.init(fileContents = readFile("sample.txt")))
p.current = p.lex.getToken()
p.next = p.lex.getToken()

while p.current.kind != TK_EOF and p.error.len == 0:
    case p.current.kind:
    of TK_LET, TK_VAR, TK_CONST:
        let this = p.current
        if p.next.kind == TK_IDENTIFIER:
            discard # handle var declaration
        else:
            p.setError("Invalid variable declaration expect identifier")
            break
    of TK_PLUS, TK_MINUS, TK_MULTI, TK_DIV:
        discard # handle math
    else: discard # and so on
    walk(p) # walk to next token

if p.error.len != 0:
    echo p.error
JewishLewish commented 1 year ago

There also seems to be an error with the comment.

When I put:

*This is a test*

/* This is a test*/

I get:

TK_MULTI
TK_IDENTIFIER
TK_IDENTIFIER
TK_IDENTIFIER
TK_IDENTIFIER
TK_MULTI
TK_DIV
TK_MULTI
TK_IDENTIFIER
TK_IDENTIFIER
TK_IDENTIFIER
TK_IDENTIFIER
TK_MULTI
TK_DIV
TK_EOF

However, the "#" seems to work perfectly.

TK_COMMENT
TK_EOF
georgelemon commented 1 year ago

Yep. That does not work right now. I hope will be fixed soon

georgelemon commented 1 year ago

Ok, now this should work. Reinstall your toktok package.

tokens:
        Div       > '/':
            BlockComment ? '*' .. "*/"            # everything starting from /* to */ (tail should be a string)
            InlineComment ? '/' .. EOL            # everything starting from // to EOL (end of line)

This is work in progress (markdown use case)

tokens:
        H1   > '#' .. EOL:
            H2 ? '#' .. EOL
            H3 ? '#' .. EOL
JewishLewish commented 1 year ago

The reason why I was asking was because I am having trouble on trying to design a truly functioning if statement / white statement. Making one work is easy but trying to make it work INSIDE another is difficult.

The tokenizer needs to have some form of indicator of which part of the code is INSIDE another part of the code. Ex.

If "True" == "True" {
  //Code Here
}

The {} can be used but in a situation where we put 2 If statements inside another, that becomes complicated. That's why I asked if there was a way to strip whitespace or merge lines similiar to: If "Test" == "Test" {//Code Here}

georgelemon commented 1 year ago

This is a generic lexer, that's why you should write this kind of logic at parser level.

By the way, instead of tables, I recommend you implement AST nodes using Nim objects.

I made a functional Parser + AST nodes, based on current toktok Lexer that does what you need. You can start from this. https://github.com/openpeep/toktok/blob/main/examples/program.nim

if 2 == 2 { /* something cool */ }

will produce the following ast

{
  "nodes": [
    {
      "nodeName": "NTCondition",
      "nodeType": 3,
      "ifCond": {
        "nodeName": "NTInfix",
        "nodeType": 6,
        "infixOp": 1,
        "infixLeft": {
          "nodeName": "NTInt",
          "nodeType": 0,
          "intVal": 2
        },
        "infixOpSymbol": "EQ",
        "infixRight": {
          "nodeName": "NTInt",
          "nodeType": 0,
          "intVal": 2
        }
      },
      "ifBody": {
        "nodeName": "NTStmtList",
        "nodeType": 7,
        "stmtList": [
          {
            "nodeName": "NTBlockComment",
            "nodeType": 5,
            "comment": " something cool "
          }
        ]
      },
      "elseBody": null,
      "elifBranch": []
    }
  ]
}

While this can't be valid

if 1 >= 1 { // this fails}

so will throw an error

Error (4:26): EOF reached before closing condition body

Pretty simple

JewishLewish commented 1 year ago

This is a generic lexer, that's why you should write this kind of logic at parser level.

By the way, instead of tables, I recommend you implement AST nodes using Nim objects.

I made a functional Parser + AST nodes, based on current toktok Lexer that does what you need. You can start from this. https://github.com/openpeep/toktok/blob/main/examples/program.nim

Oh damn that's beautiful. I, myself, am not that big in the field of computer science however I do love programming and data science.

I started working on a side-project programming lang called "Barcelona" that was suppose to make database/statistics/requests data much easier and straight to the point. It is an experimental project that has a mixture of Rust's memory safety tools, Python's simple syntax and Nim's Performance. In addition has built in translators to convert code to Python or Rust or make Python files execute Barcelona's code to make request process quicker.

I am going to be having a 6-week college break so I plan to severally work on Barcelona but I am new to understanding how a program. language works.

Any advice would be appreciated.

georgelemon commented 1 year ago

Nim is a good start. Also thanks for trying toktok. Hobby projects are awesome.

Maybe you don't have to write a micro language, you have Nim macros! https://nim-lang.org/docs/macros.html

Also, check Computer Programming with the Nim Programming Language

JewishLewish commented 1 year ago

Nim is a good start. Also thanks for trying toktok. Hobby projects are awesome.

Maybe you don't have to write a micro language, you have Nim macros! https://nim-lang.org/docs/macros.html

Also, check Computer Programming with the Nim Programming Language

Nim is an interesting language, I started learning a few days ago and it looks pretty good. I love how it's a mixture of Python's Simplicity, C's Performance and Lisp's flexibility. In addition to the fact that you can compile code to C, C++, Obj C is really cool. (also JS)

The only problem I am having is trying to make a function CERTAIN elements in a list. Ex. in python's list, you can do List[3:-1] where it would get the 3rd value to 2nd-last value. Nim doesn't seem to have that ability (or I can't find it).

Also Windows Defender seems to find it as a virus after version 1.4.0 (idk why). Cool and unique language to learn.

georgelemon commented 1 year ago

Regarding Windows Defender, that's a false positive. https://github.com/nim-lang/Nim/issues/17820. Report as a false positive, if is possible.

You mean something like this?

var a = @["bread", "cake", "tomorrow"]
echo a[1 .. ^1]     # output @["cake", "tomorrow"]
echo a[2 .. ^1]     # output @["tomorrow"]

Bookmark this String Functions: Nim vs Python. Most of these examples work with seq and array / openarray

JewishLewish commented 1 year ago

Regarding Windows Defender, that's a false positive. https://github.com/nim-lang/Nim/issues/17820. Report as a false positive, if is possible.

Ah that would make a lot of sense tbh. I recall there was a time where you can trigger people's windows defenders via discord just by sending pictures / videos.

You mean something like this?

Something like that! Thank you!

JewishLewish commented 1 year ago

image

I tried to make a seperate nim file to run onto the main nim file and it seems to give out an interesting error.

JewishLewish commented 1 year ago

Update it appears the error is being created because I am importing std/strutils from the 2nd file (not the main file)

JewishLewish commented 1 year ago

Also image

UPDATE THIS WAS FIXED!

JewishLewish commented 1 year ago

Do you think you can add the token for float numbers?

The token for: 2.0 4 4.0

Would be: (kind: TK_INTEGER, value: "2", wsno: 0, line: 1, col: 0, pos: 0) (kind: TK_PERIOD, value: "", wsno: 0, line: 1, col: 1, pos: 1) (kind: TK_INTEGER, value: "44", wsno: 0, line: 1, col: 2, pos: 2) (kind: TK_PERIOD, value: "", wsno: 0, line: 1, col: 5, pos: 5) (kind: TK_INTEGER, value: "0", wsno: 0, line: 1, col: 6, pos: 6) (kind: TK_EOF, value: "", wsno: 0, line: 1, col: 7, pos: 7)

georgelemon commented 1 year ago

Thanks for that! Update your local toktok

JewishLewish commented 1 year ago

Another suggestion I would offer is the idea of multi-threading. Do you think it would be possible for it to be implemented?

JewishLewish commented 1 year ago

image

Not sure why but it gives off this "duplicate" import warning from tiktok.

georgelemon commented 1 year ago

try use toktok in a separate file, let's say "tokens.nim", there you will define your tokens. Then import tokens.nim in your parser.

tokens.nim

import toktok

tokens:
  # here

parser.nim

import ./tokens