nelmio / alice

Expressive fixtures generator
MIT License
2.5k stars 329 forks source link

Consider a third-party compiler/lexer for the ExpressionLanguage #601

Open theofidry opened 8 years ago

theofidry commented 8 years ago

As mentioned in #600, the current lexer/parser of the Expression Language is completely custom. While it does the job, I can't say I'm very proud of the implementation and it's far from my field of expertise. Relying on a third-party library for that task would make sense, maybe:

And eventually other (I didn't take the time to properly look into it).

The goal of this component is to be able to transform values as described in #377. Maybe a more detailed example is the actual integration test of the Expression Language parser: ParserIntegrationTest

theofidry commented 8 years ago

@Hywan as the maintainer of the HoaCompiler, WDYT of this choice here, is it something the compiler would be good at?

Hywan commented 8 years ago

Hello and thanks for considering Hoa\Compiler 😃!

Analysing a language and compiling it into something else is the essence of Hoa\Compiler, so yes. You can write your own grammar thanks to the PP grammar description language (see an example from the README.md).

The grammar might be minimalist if I am correctly reading your examples. The visitor will be simple too. A simple example you might want to look at is the Hoa\Ruler library. The grammar is very minimalist and you have several compilers (called visitors here because they visit the produced AST), like interpreter to compile from text to in-memory object model, or compiler to compile from in-memory object model to PHP code.

So this is a big yes 😉.


That said, Hoa\Compiler provides mechanisms you will love.

A grammar is used to represent any kind of data. Thus, we can use it to validate a data (which is the classical usage), or to… generate a data. This was a big part of my PhD thesis about Praspel. Long story short, with a grammar expressed with PP and 1 algorithm within 3, you can generate data that match the grammar. I am copy-pasting the example from the README.md:

$sampler = new Hoa\Compiler\Llk\Sampler\Coverage(
    // Grammar.
    Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
    // Token sampler.
    new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random())
);

foreach ($sampler as $i => $data) {
    echo $i, ' => ', $data, "\n";
}

/**
 * Will output:
 *     0 => true
 *     1 => {" )o?bz " : null , " %3W) " : [false, 130    , " 6"   ]  }
 *     2 => [{" ny  " : true } ]
 *     3 => {" Ne;[3 " :[ true , true ] , " th: " : true," C[8} " :   true }
 */

This approach and these algorithms are used to do what we call: Grammar-based Testing. See the research paper here:

Several people (like @jubianchi or @vonglasow) are using this approach to generate test data or to populate a database. They write a grammar, they generate data based on this grammar and boom. The most common example I hear is: Describing a JSON payload with the grammar and generate data from it.

There are 3 algorithms. They are described in the hack book of Hoa\Compiler.

Considering the goal of your project, these algorithms can be very very… very useful for you.


There is one more thing… To be able to generate data from a grammar, we need to be able to generate data for the tokens. Token values are represented by PCRE. So… you guessed it, we are able to generate data based on a regular expression. See the Hoa\Regex library, it shows one example. I am copy-pasting the most interesting part here:

// 1. Read the grammar.
$grammar  = new Hoa\File\Read('hoa://Library/Regex/Grammar.pp');

// 2. Load the compiler.
$compiler = Hoa\Compiler\Llk\Llk::load($grammar);

// 3. Lex, parse and produce the AST.
$ast      = $compiler->parse('ab(c|d){2,4}e?');

// 4. Set up the sampler.
$generator = new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random());

// 5. To infinity and beyond!
echo $generator->visit($ast);

/**
 * Could output:
 *     abdcde
 */

I don't mean to make some advertisements here, but I really think it can provide really cool features.

theofidry commented 8 years ago

Thanks for the detailed answer @Hywan :)

I hope I'll have time to look into this soon. To be completely transparent this part is not exactly my priority now as I have still quite a lot to do for alice, AliceDataFixtures and HautelookAliceBundle. The priority being stabilising the three libraries and easing the migration.

I would love however to have the time and energy to look into it before the stable release, it will avoid to go stable with the whole Expression Language marked as internal. That said, maybe someone else is ready to tackle this RFC :P

Hywan commented 8 years ago

We can if needed. If you play the role of the PO, draft all the issues etc., I am sure we could find time to help :-).

theofidry commented 8 years ago

Hehe I need to update the doc, but otherwise I think for a developer, the best doc is ParserIntegrationTest. Anything internally on how to generate this result is internal and can be completely changed.

There is definitely a scenario or two missing, I tried to be as exhaustive as possible but well I'm not a machine and the sheer number of combinations not coverable either, but it gives a good base I would say.

Hywan commented 8 years ago

@theofidry Where is the grammar defined?

theofidry commented 8 years ago

That's the thing there is no proper grammar system. Basically there's a lexer (which has its own share of tests) which transforms expressions into Tokens like:

yield '[Escaped arrow] surrounded' => [
            'foo \< bar \> baz', // input
            [ // expected
                new Token('foo ', new TokenType(TokenType::STRING_TYPE)),
                new Token('\<', new TokenType(TokenType::ESCAPED_VALUE_TYPE)),
                new Token(' bar ', new TokenType(TokenType::STRING_TYPE)),
                new Token('\>', new TokenType(TokenType::ESCAPED_VALUE_TYPE)),
                new Token(' baz', new TokenType(TokenType::STRING_TYPE)),
            ],
        ];

And then the parser will, depending of the type of the token, parse the value accordingly depending of the token type.

So as of now, it's pretty manual hence the desire to change to something more standard :)

Hywan commented 8 years ago

I see. I guess the users have a documentation with all the possible syntax?

theofidry commented 8 years ago

Yep, #377 which may be slightly outdated right now and ParserIntegrationTest. Tests being a big part of the doc here for the better or the worst :/

Hywan commented 8 years ago

Great! I don't have time right now but I will try to find some. Maybe some Hoackers could help me. What's your schedule?

theofidry commented 8 years ago

I hope to have finished most of it by the end of the month. Then it will be a few updates or bugfixes here and there and let it live for 2-3 months before a stable release.

theofidry commented 7 years ago

@Hywan I took a glance this weekend for the Compiler, looks like a good solution to replace the in-house lexer. I still have a few issues with your PP language but I think it's just a matter of getting familiar with it. I'm not sure if I should do it after or before the stable release yet. A little question though: why are hoa projects not semver?

Hywan commented 7 years ago

@theofidry Funny, I opened your issue this weekend too 😛. I can help to write the grammar (in PP) if you need help.

Hoa libraries are compatible with semver, but here is the answer: https://hoa-project.net/En/Source.html#Rush_Release.

theofidry commented 7 years ago

Cool :) I'll push a POC soon to be able to discuss on it then :)

Hywan commented 7 years ago

Perfect! Please, ping me.

kgilden commented 5 years ago

@theofidry perhaps there are other out-of-the-box workarounds.

Rather than coming up with a special language (which users would have to spend time learning), what if the project adopts the Expression Language?

# before
Is\Bundle\PlanBundle\Entity\Event:
  event_bare (template):
    title: <sentence(3)>
    show: '@show_*'
    rooms[0]: '@room_*'
    startDateTime: '<dateTimeBetween("-1 month", "+4 month")>'
    endDateTime: '<dateTimeInInterval($startDateTime, "+4 hours")>'
    isDraft: false
    version: '10%? @version_*'
    tags 25%?: ['<randomElement(@tag_{0..3})>']
    __calls:
      - setRevenue (25%?): ['<moneyBetween(10000, 300000)>']
      - setVisitorCount (25%?): ['<numberBetween(100, 500)>']
# after
Is\Bundle\PlanBundle\Entity\Event:
  event_bare (template):
    title: faker.sentence(3)
    show: alice.one('show_*')
    rooms: faker.randomElements(alice.some('room_*'), faker.randomNumber(1, 2))
    startDateTime: faker.dateTimeBetween('-1 month', '+4 months')
    endDateTime: faker.dateTimeInInterval(this.startDateTime, '+4 hours')
    isDraft: false
    version: alice.sometimes(0.1, alice.one('version_*'))
    tags: alice.sometimes(0.25, faker.randomElements(alice.some('tag_*')))
    revenue: alice.sometimes(0.25, myown.moneyBetween(10000, 300000)
    visitorCount: alice.sometimes(0.25, faker.numberBetwen(100, 500))

In a nutshell I'd propose these changes. These are just some thoughts that came up as I was thinking about this.

  1. Use the expression language to make complex expressions possible.
  2. In the expression language expose all Alice-related functions under alice. and Faker-related functions under faker.. Currently it's very difficult to know which documentation to consult.
  3. In the expression language let this point to the current fixture.
  4. Keep the syntax for matching related fixtures. 4.1 To reference a single fixture, use alice.one('potato_*') or alice.one('potato_{0..3}'). If more than 1 fixture is matched, pick one randomly. 4.2 To reference multiple fixtures, use alice.some('potato_*'). That would return all fixtures that match the pattern.
  5. Custom functions could either be exposed under custom. or just globally.
  6. For optional values just use a function rather than snytax, so foo 25%?: potato would become foo: alice.somtimes(0.25, potato) or something similar.

What are your thoughts? Surely, this is a breaking change, but I think that this change would let maintainers focus more of their time on features rather than having to wrestle with the idiosyncrasies of the syntax.

theofidry commented 5 years ago

Hi @kgilden.

It's an interesting proposal indeed. A couple of notes however:

from:

version: '10%? @version_*'
tags 25%?: ['<randomElement(@tag_{0..3})>']

to:

version: alice.sometimes(0.1, alice.one('version_*'))
tags: alice.sometimes(0.25, faker.randomElements(alice.some('tag_*')))

I am not sure this is equivalent. Indeed for version the next syntax is correct, but not for tags since in 25% cases, tags won't be called at all, not receive null.

The same for __calls which is a well separated step for which the result may be re-used and unlike for hydration, the calls here points at methods whereas during the hydration, the hydrator may use the property directly (even if private depending of your config).

I however don't think it invalidates your suggestion.

I like the idea, but I'm mitigated since:

kgilden commented 5 years ago

Thanks @theofidry,

Cool that you're considering this. And apologies for not being quite rigorous in my proposal. I suppose the gist of my proposal is to replace the current syntax out with expression language.

It is a big BC break since it completely changes the syntax for the users which remained unchanged since 1.x.

Agreed that this would be a big BC break and I hate them as much as any other dev. Maybe it would be possible to keep BC by introducing syntax versions (user specifies on top of the file which version of the syntax they prefer to use, i.e. version: 1).

If we break it that much, I'm wondering if we are not better off going a step further and going for PHP based templates instead of YAML as it would remove any need for expression languages.

Could you show an example of what you have in mind? In my opinion one of the nice things about this library is that fixture generation is terse. Sure, I could use plain Doctrine Fixtures, but the end result tends to be complex and difficult to update. So if PHP templates keeps to the same terseness, I'd be :+1: with that. Anything goes for me that would allow me to sometimes use function nesting without any surprises (such as https://github.com/nelmio/alice/issues/842).


I'd be interested in what other users of this library think of this as well.

theofidry commented 5 years ago

Sure, sorry I didn't do that yesterday, I had to give it a bit more thoughts & time:

<?php

use Is\Bundle\PlanBundle\Entity\Event;
use Nelmio\Alice\Alice;

return [
  Event::class => [
    'foo1' => [
      'title' => Alice::faker()->sentence(3),
      'show' => Alice::reference('@show_*'),
      'startDateTime': Alice::faker()->dateTimeBetween('-1 month', '+4 month'),
      'isDraft' => false,
      'version' => Alice::optional(10, Alice::reference('@version_*')),
      '__calls': [
        'setRevenue (25%?)': [Alice::faker()->moneyBetween(10000, 300000)]
      ],
    ],
    'foo2' => new Foo(),
  ],
];

There is three immediate advantages there:

kgilden commented 5 years ago

Honestly I'd be cool with both directions. As long as it would be possible to add custom extensions that in turn are dependent on other services (i.e. Symfony DI).

However, I'm a bit worried that perhaps it becomes more verbose and gives too much "power". I like the fact that the current YAML syntax limits developers from writing long complex code and keeps the focus more on relationships between fixtures.

theofidry commented 5 years ago

As long as it would be possible to add custom extensions that in turn are dependent on other services (i.e. Symfony DI)

I don't think this would be too difficult and I agree it's a requirement: HautelookAliceBundle depends on it as well.

However, I'm a bit worried that perhaps it becomes more verbose and gives too much "power"

That's a risk, but I think it's ok. Right now the vast majority of the issues are about a lexing/parsing problem which can only be solved by this PR and even so, people feel overburdened from this YAML syntax and trying to learn alice DSL.

Also for the record, in 1.x & 2.x it was also possible to a certain extend (just not as discoverable).

vctls commented 5 years ago

Please excuse the the perhaps not so relevant comment, but does this mean nested functions like <some_custom_provider(<numberBetween(1,5)>)> can't actually be parsed? I've tried all sorts of different syntaxes and it either evaluates numberBetween(1,5) as a string, or fails with the following error:

In ExpressionLanguageExceptionFactory.php line 59:
  The value "<numberBetween(1" contains an unclosed function.

All I found was this related issue hautelook/AliceBundle#327.

theofidry commented 5 years ago

They can and they are to a certain extend. It however relies on regexes which is extremely flimsy

vctls commented 5 years ago

Is there a workaround? Anything with more than one argument seems to break with the same error. I tried escaping the coma and what not. Even the example given in the the docs breaks:

App\Entity\Dummy:
    dummy:
        functionValue: '<strtolower("BAR")>'
        nestedFunctionValue: '<strtolower(<(implode(" ", ["HELLO", "WORLD", \<foo()>]))>)> \<bar()>'

In TolerantFixtureDenormalizer.php line 68: An error occurred while denormalizing the fixture "dummy" (App\Entity\Dummy): The value "<(implode(__ARG_TOKEN__7215ee9c7d9dc229d2921a40e899ec5f" contains an unclose d function.

Escaped expressions also fail:

'<strtolower(<(implode(["HELLO"]))>)> \<bar()>'

In TolerantFixtureDenormalizer.php line 68: An error occurred while denormalizing the fixture "dummy" (App\Entity\Dummy): Invalid token "\" found.

I'm using Alice 3.5.7, by the way.

theofidry commented 5 years ago

The easiest workaround is:

vctls commented 5 years ago

Ok, so these really don't work. Thank you for the clarification. I wanted to upgrade from Alice 2.3 and AliceBundle 1.4, and I have a lot of fixtures to change. I was thinking of cramming everything into the providers, but I'll just stick to the old versions for now.

theofidry commented 5 years ago

If your fixtures works it's fine, just be aware that only barely 10% of alice 2.x is actually tested so it's based a lot on luck... I think however https://github.com/nelmio/alice/issues/998 is the real solution that will make everyone happy tbh