reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.19k stars 215 forks source link

Fails to read UTF8-BOM encoded files #578

Closed JorikMA closed 4 months ago

JorikMA commented 4 months ago

When trying to read some .sql scripts I ran into the following error: Unable to lex characters: 'CREATE' Line 1, Position 1: Found unparsable section: 'CREATE VIEW ...

As I did not see any special characters I opend the file in an hex editor and found: EF BB BF as first 3 "characters". Apparently this is the Byte Order Mark or BOM (see: https://stackoverflow.com/questions/44098326/ef-bb-bf-at-the-beginning-of-json-files-created-in-visual-studio)

UTF-8 encoded file: `PS C:\Users\username> sqllineage -f "C:\Users\username\OneDrive - company\Bureaublad\test.sql" -d tsql Statements(#): 1 Source Tables:

.sometable Target Tables: ` UTF-8 with BOM encoded file: `PS C:\Users\username> sqllineage -f "C:\Users\username\OneDrive - company\Bureaublad\test.sql" -d tsql Traceback (most recent call last): File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\username\AppData\Local\Programs\Python\Python310\Scripts\sqllineage.exe\__main__.py", line 7, in sys.exit(main()) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\sqllineage\cli.py", line 127, in main runner.print_table_lineage() File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\sqllineage\runner.py", line 179, in print_table_lineage print(str(self)) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\sqllineage\runner.py", line 26, in wrapper self._eval() File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\sqllineage\runner.py", line 199, in _eval stmt_holder = analyzer.analyze(stmt, session.metadata_provider) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\sqllineage\core\parser\sqlfluff\analyzer.py", line 46, in analyze statement_segments = self._list_specific_statement_segment(sql) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\sqllineage\core\parser\sqlfluff\analyzer.py", line 84, in _list_specific_statement_segment raise InvalidSyntaxException( sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL: SELECT col1, col2 FROM someTable Unable to lex characters: 'SELECT' Line 1, Position 1: Found unparsable section: 'SELECT col1, col2 FROM someTable'` Sample files: [sqllineage-test.zip](https://github.com/reata/sqllineage/files/14178632/sqllineage-test.zip) Versions: Python 3.10.6 sqllineage 1.5.1
maoxingda commented 4 months ago

This is not a sqllineage problem, but a sqlfluff problem.

reata commented 4 months ago

Please watch issue #482. We will provide an encoding option so that users can open customized encoded sql files.

After that feature is delivered, you can read the sql file with "utf_8_sig" encoding so that the first 3 BOM characters are auto removed. That way, it won't throw an exception when we feed the sql text to the parser.