lydell / js-tokens

Tiny JavaScript tokenizer.
MIT License
502 stars 31 forks source link

Escaped newlines between multi-line strings cause merging of different strings #38

Closed BenjaminNolan closed 1 year ago

BenjaminNolan commented 1 year ago

We're writing a JS util to parse CLI commands, and whilst testing various inputs I realised that if you have a multi-line command with escaped newlines where the escaped newline falls between two multi-line strings which use the same quote character, then the tokens are mis-handled and the quotes either side of the escaped newline get treated as a single string argument.

This code generates ' --data ' as a single token, whereas I think the escaped newline should be ignored but treated as a token break rather than the middle of a string.

import jsTokens from "js-tokens";

// Snipped from a multi-line command, eg, `curl -X POST url \
// --data etc`
const exampleInput = `
--data '{
  \
}' \
--data '{
  \
}'
`;

const tokens = jsTokens(exampleInput);
console.log(Array.from(tokens));

/* Output:
[
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'Punctuator', value: '--' },
  { type: 'IdentifierName', value: 'data' },
  { type: 'WhiteSpace', value: ' ' },
  { type: 'StringLiteral', value: "'{", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'WhiteSpace', value: '  ' },
  { type: 'Punctuator', value: '}' },
  { type: 'StringLiteral', value: "' --data '", closed: true },
  { type: 'Punctuator', value: '{' },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'WhiteSpace', value: '  ' },
  { type: 'Punctuator', value: '}' },
  { type: 'StringLiteral', value: "'", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' }
]
*/
import jsTokens from "js-tokens";

const expectedOutputInput = `
--data '{
  \
}'
--data '{
  \
}'
`;

const tokens = jsTokens(expectedOutputInput);
console.log(Array.from(tokens));

/* Expected output:
[
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'Punctuator', value: '--' },
  { type: 'IdentifierName', value: 'data' },
  { type: 'WhiteSpace', value: ' ' },
  { type: 'StringLiteral', value: "'{", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'WhiteSpace', value: '  ' },
  { type: 'Punctuator', value: '}' },
  { type: 'StringLiteral', value: "'", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'Punctuator', value: '--' },
  { type: 'IdentifierName', value: 'data' },
  { type: 'WhiteSpace', value: ' ' },
  { type: 'StringLiteral', value: "'{", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'WhiteSpace', value: '  ' },
  { type: 'Punctuator', value: '}' },
  { type: 'StringLiteral', value: "'", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' }
]
*/

Any ideas if there's a way to force the tokeniser to handle this, or is it a bug as I suspect?

lydell commented 1 year ago

Hi!

First, a small piece of advice: Always try your steps to reproduce. Your code fails with:

const tokens = jsTokens(expectedOutputInput);
      ^

SyntaxError: Identifier 'tokens' has already been declared

No big deal though, I could easily fix it by renaming the second constant.

This piece of code:

const exampleInput = `
--data '{
  \
}' \
--data '{
  \
}'
`;

Actually means:

const exampleInput = `
--data '{
  }' --data '{
  }'
`;

I checked through the tokens for that, and they look as expected.

Also note that your example input is nowhere near valid JavaScript. It looks more like you try to parse bash code – a piece of a CLI call with flags. That feels weird to me without knowing more!

I then didn’t look at your second example and your expected tokens, because it felt like they would be wrong now that I found some oddities with the first example.

Can you clarify? Thanks!

BenjaminNolan commented 1 year ago

There should've been two separate code blocks there, not one. My bad. :)

So, yeah, this is the minimal example to create the error. The examples themselves are somewhat contrived as we need to handle about a million edge-cases, this just happens to be the one that I noticed it on, and I created the example above from what was happening as a minimal example. The similarity of the content to JSON is a red herring, the data could be any arbitrary text, for example markdown content, or partial JSON structures.

As I mentioned in the preamble, we are indeed tokenising Command Line Interface / bash commands. Part of our interface takes in a cURL command and converts it to an object, similar to what Postman does when you import a cURL command via their import dialog. The actual test command which started this all off is below, and if you put it into their import dialog, it correctly parses the two data segments as separate strings with a --data flag between them and combines the two content strings together with an & in the body. (NB, they actually handle --data incorrectly when using --get, and they ignore --url-query completely.)

The reason this feels like an unintended result is that the tokeniser handles \ followed by a newline character outside of a string as an Invalid type with the content \ followed by a LineTerminatorSequence, so I'd expect the same to happen after closing the first multi-line string. Also, I'm now wondering whether an escape followed by a newline should be collapsed within a string, as that changes the content of the string... Thoughts?

curl -X POST "https://localhost:9191/echo-chamber" \
--header "Authorization: Bearer TEST_TOKEN" \
-H "Content-Type: application/json" \
-H "Prefer: application/json" \
-H "Via: something" \
-H "Connection: test" \
--url-query name=Merlin \
--url-query breed=dogs \
--url-query age=2 \
--data 'Test1
  Test2  \
Test3' \
--data '{
    "records": [
      {
        "fields": {
          "Address": "1 Twothree St",
          "Name": "St John\'s Wood",
          "Visited": true
        }
      }
    ]
  }'

Please note this test command is not supposed to result in correct handling on the recipient side, it is intended to provide one of many various test cases for parsing the CLI command itself. The actual command is perfectly valid.

lydell commented 1 year ago

Now I’m thoroughly confused and I don’t understand anything at all.

If you can make a small, as simple as possible, piece of JS code that I can run that shows me your problem as if I was 5 years old, that would be great.

lydell commented 1 year ago

Closing because no response.