p4lang / p4c

P4_16 reference compiler
https://p4.org/
Apache License 2.0
667 stars 441 forks source link

UTF8 BOM vs P4C #3837

Open apinski-cavium opened 1 year ago

apinski-cavium commented 1 year ago

Not running the preprocessor causes p4test not support files which have an UTF8 BOM on it. I know the P4 language spec says the source is written in ASCII but ASCII is a subset of UTF8 so I had expected this to work. The only place where you might run into difference between ASCII and UTF8 is inside string literals which already is mentioned is passed without any change.

The reason why this works with the preprocessor is that both GCC and clang will output preprocessed sources files without the BOM. So it just works.

apinski@xeond:~/src/p4/octeontxkpu$ ../p4c/build/p4test --nocpp  ut8-bom.p4
ut8-bom.p4(0):syntax error, unexpected UNEXPECTED_TOKEN
�
^
[--Werror=overlimit] error: 1 errors encountered, aborting compilation
apinski@xeond:~/src/p4/octeontxkpu$ !od
od -c ut8-bom.p4
0000000 357 273 277  \n   #   i   n   c   l   u   d   e       <   c   o
0000020   r   e   .   p   4   >  \n  \n   /   /       {       d   g   -
0000040   w   a   r   n   i   n   g       "   m   a   i   n   "       "
0000060   "       {       t   a   r   g   e   t       *   -   *   -   *
0000100       }       0       }  \n
0000107
vlstill commented 1 year ago

I know the P4 language spec says the source is written in ASCII but ASCII is a subset of UTF8 so I had expected this to work.

Well BOM is 0xFE 0xFF, so it is not ASCII. But my personal inclination is this should still be supported, especially if it works with preprocessor. At the very least, p4c can strip the BOM.

I'm not sure how much would be UTF-8 useful in p4 though (maybe in comments?) since there is only very limited use of strings in P4. Did you use UTF-8 somewhere in the source apart from the BOM?