tagua-vm / parser

Safe, fast and memory efficient PHP parser (lexical and syntactic analysers, and the Abstract Syntax Tree)
http://tagua.io/
119 stars 15 forks source link

Literal: *doc strings can have spaces before ID #33

Closed Hywan closed 8 years ago

Hywan commented 8 years ago

See example in https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#heredoc-string-literals:

$s = <<<    ID
S'o'me "\"t e\txt; \$v = $v"
Some more text
ID;

Currently, spaces before ID are not supported.

nikic commented 8 years ago

Slightly off topic but: https://github.com/tagua-vm/parser/blob/master/source/rules/literals.rs#L217 => This can also be 'B'. It also looks like b/B handling is missing for nowdocs (which also support it).

nikic commented 8 years ago

And both things are not in the spec :(

Hywan commented 8 years ago

Damn… Could you open issues on this repository please?

Hywan commented 8 years ago

And good catch!

nikic commented 8 years ago

I've opened #43

nikic commented 8 years ago

Another inaccuracy in the spec is that <<<'A'\nA\n (with only a single newline between 'A' and A) is a valid nowdoc string (the empty string). Is this properly supported?

Spec fix: https://github.com/php/php-langspec/commit/8e8df911b732cc08ca0ae45f788c918147b9d007

nikic commented 8 years ago

The current nowdoc implementation currently doesn't handle CRLF newlines.

Hywan commented 8 years ago

Quoting https://github.com/tagua-vm/parser/blob/05ff4487d6180da665e7022c91f491382cd194c8/source/rules/literals.rs#L1016-L1024:

    #[test]
    fn case_string_nowdoc_empty() {
        let input  = b"<<<'FOO'\n\nFOO\n";
        let output = Result::Done(&b""[..], Literal::String(Vec::new()));

        assert_eq!(string_nowdoc(input), output);
        assert_eq!(string(input), output);
        assert_eq!(literal(input), output);
}

So yes, it is supported.

But you raised another issue where we can have \r\n instead of \n only? Where are they located exactly? We can have <<<'FOO'\r\nFOO\n which is valid (note there are 2 forms)?

nikic commented 8 years ago

@Hywan The test only tests b"<<<'FOO'\n\nFOO\n", but not b"<<<'FOO'\nFOO\n" (which represent the same string).

Hywan commented 8 years ago

Hmm. Is it too much to ask you to open issues with detailed examples please?

Hywan commented 8 years ago

@nikic About CRLF, the grammar defines a new line as \n, \r or \r\n. I know OS that uses \n (all Unix), \r\n (most Windows) but no one with \r. Why this one?

After some searches, it appears that the following OS define \r alone: Commodore 8-bit machines, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, Oberon, Mac OS up to version 9, MIT Lisp Machine and OS-9. I am pretty sure we can drop them safely since this is not a binary target for the compiler.

Also, there is \n\r on Acorn BBC and RISC OS spooled text output, or 0x9b on Atari 8-bit machines using ATASCII variant of ASCII (155 in decimal). This is very strange forms.

So, is there a good reason to support \r? It adds complexity into the parser, especially in error management.

nikic commented 8 years ago

@Hywan I guess this is just one of those things that made sense 20 years ago, but is pretty pointless now.

Hywan commented 8 years ago

Thanks for your feedback 😄.