parse efficiently ordered file paths to build a tree

mathiasverraes commented 4 years ago

Reported on Twitter https://twitter.com/Wasquen/status/1276906414833758208?s=20

Awesome! Good job! I'm working on a small library using Parsica I'd like to open source.

I'm facing some performance issues.

How would you parse efficiently ordered file paths to build a tree?

[
  "/a/b/c/file1",
  "/a/b/c/file2",
  "/a/b/c/file3",
  "/a/b/file4"
]

I'd like to avoid to parse "/a/b/c/" many times.

It's hard to do anything dodgy because I don't have access to remainder/input.

mathiasverraes commented 4 years ago

See the commit above, that shows how to parse the path. But I'm sure you figured that part out yourself.

I'm not sure what you want to achieve. Are you trying to make the parser somehow understand that parts of the path are repeated?

grifx commented 4 years ago

Thank you for your responsiveness!

To provide you with more context. I'm building an open source utility to normalize a flat map using a small DSL. I'm currently using in production a version of this utility I wrote using imperative code. I thought it would be nice to build an AST Tree using the Parsica library that I use to normalize the data.

I already managed to build a single branch of the AST tree by parsing a single key ("$.a.b.c.d").

[
  {
    "id@int": "123",
    "name": "Brian",
    "car.color":"red",
    "car.type": "SUV",
    "children[childId].id": "1",
    "children[childId].name": "Tom",
    "_childId": "1",
  },
  {
    "id@int": "123",
    "name": "Brian",
    "car.color":"red",
    "car.type": "SUV",
    "children[childId].id": "2",
    "children[childId].name": "Harry",
    "_childId": "2",
  }
]

=>

{
  "id": 123,
  "name": "Brian",
  "car": {"color":"red", "type": "SUV"},
  "children" [{"id": "1", "name": "Tom"}, {"id": "2", "name": "Harry"}]
}

I'm not sure what you want to achieve. Are you trying to make the parser somehow understand that parts of the path are repeated?

That's correct, I would like to "skip" the parsing of repeated strings

[
  "/a/b/c/file1",
  " **/a/b/c/** file2",
  " **/a/b/c/** file3",
  " **/a/b/** file4"
]

I would like to introduce a ParsingContext to my parser factory to provide me with cachedParsingSteps:

function parserBoundToContext(ParsingContext $context) {
    $resolveFromCache = fn($input) $this->resolveFromCache($input, $context);

    $parser = (
        // resolved from $context->cachedParsingSteps
        string("/a/b/c/")->map($resolveFromCache)
        ->or(string("/a/b/")->map($resolveFromCache))
        ->or(string("/a/")->map($resolveFromCache))
        ->or(nothing())
    );
    // ...
}

Whenever a folder is parsed, I would like to enrich the cachedParsingSteps with the input (the whole input since the start -- could be generated from the whole input minus the remainder) and its result.

Ideas

Provide both the value and the remainder in Parse::map($value, $remainder). We could use reflection to know if the param $remainder is declared.
Maybe we can create a cache($parser, $context) parser wrapper that would do something like: (startCacheNothing($context)->followedBy($parser)->thenIgnore(endCacheNothing($context)))->map(enrichContextWithValue($context))

I hope all this makes sense to you.

Cheers,

mathiasverraes commented 4 years ago

Some thoughts:

You do have access to input, output, and remainder during the parsing process right now, but only by writing your own combinator. This is the general pattern

<?php
function yourCombinator(Parser $parser, $context) : Parser {
    return Parser::make(function (string $input) use ($parser, $context) : ParseResult {
             $result = $parser->run($input);
             if($result->isSuccess()) {
                 // You now have access to $result->output() and ->remainder(),
                 // as well as the original $input, the original $parser, and your $context
             }
             // Return a ParseResult
        }
    });
}

2.The above could be replaced by something similar to map. I don't want change the definition of map because that would conflict with the common definition of map in FP languages. But it could be something like map2(Parser, fn($output, $remainder) : ParseResult) : Parser

Having side effect when something is parsed (such as updating a cache) could be done with events. emit(Parser, fn($output):void) : Parser. In other words, it's a combinator that returns a Parser that behaves exactly like the original parser, but performs side effects in function.

<?php
$cache = new Cache();
$addToCache = fn($output) => $cache->add($output);
$parser = emit(char('a'), $addToCache); // If we successfully parse 'a', add 'a' to cache.

I've added emit() in the commit above. (It's not in the main branch yet.) I'm not sure about the map2 thing yet, but you could easily make that yourself by adding a map2 to both Parser and ParseResult.

I'm still a bit unclear as to what exactly the point is of what you're trying to do, but let me know if these things help.

grifx commented 4 years ago

I think I found a way to achieve what I initially wanted to do. I'll try to implement it and close this PR by the end of this week.

--

Do you think we should add this utility?

function lazy(callable /*: () => Parser */ $parserFactory): Parser
{
    return Parser::make(function (string $input) use ($parserFactory): ParseResult {
        $parser = $parserFactory();

        return $parser->run($input);
    });
}

$parser = lazy(fn() => string('joris'));

mathiasverraes commented 4 years ago

Good to hear you fixed it. The lazy combinator was suggested by someone on Twitter as well. I'm a bit hesitant to add it, because it forces the user to think about performance. I'm hoping the avoid that. We have some ideas that we want to try, but there are a number of intermediate steps we need first. So I don't want to add performance features before we can measure it and exhaust other options.

grifx commented 4 years ago

I finished implementing the parser using Parsica. The library is now working as expected : ).

Some notes:

Unexpected behaviour when using or() (used under the hood by either) with emit().

either(
emit(
    success(),
    fn() => var_dump("I expect to be triggered")
),
emit(
    success(),
    fn() => var_dump("I do not expect to be triggered")
)
)->try('test');

My use-case: I'm trying to parse from the cache (sometimes successful) "or" using the root parser (always successful). I don't expect both events to be emitted.

return either(
    $context->preflightCacheParser(),
    $root
)->followedBy($rest);

This is what allows me to skip the repeated parts:

[
  "/a/b/c/file1",
  " **/a/b/c/** file2",
  " **/a/b/c/** file3",
  " **/a/b/** file4"
]

As you already know, it works when replacing the current implementation of or with the commented implementation. Perhaps it's another reason, beyond performance, to get rid of the current implementation. "or()" is used by quite a lot of utils that would have to be rewritten, I hope the default behaviour will change.

When using emit(), the $receiver does not have access to: $output, $input and $remainder. My use case requires both $output and $remainder. I wouldn't know how to implement it with the current emit(). Saying that, unlike the or() behaviour, it's very easy to go around the problem by creating a new emit function.
The Persica implementation of the Parser is very slow. This utility is now roughly 10 times slower than its previous version. Of course, we have to take in consideration that it's much easier to maintain and an extra layer of cache would result in similar performance. Unfortunately, in some environments, cross-request caching isn't possible.

    /**
     * @test
     */
    public function it_should_parse_under_100_ms()
    {
        $propertyName = atLeastOne(alphaNumChar());

        $type = emit(
            either(
                eof(),
                char('@')
                    ->followedBy($propertyName)
                    ->thenIgnore(eof()),
            ),
            function () {}
        );

        $map = emit(
            char('.')->followedBy($propertyName),
            function () {}
        );

        $list = emit(
            between(
                char('['),
                char(']'),
                either(
                    char('@')
                        ->followedBy($propertyName)
                        ->map(fn($value) => [
                            'discriminatorName' => $value,
                            'keepKeys'          => true
                        ]),
                    $propertyName
                        ->map(fn($value) => [
                            'discriminatorName' => $value,
                            'keepKeys'          => false
                        ]),
                )
            ),
            function () {}
        );

        $root = emit(
            char('$'),
            function () {}
        );

        $rest = many(any($map, $list))->followedBy($type);

        $parser = either(
            failure(), // $context->preflightCacheParser(),
            $root
        )->followedBy($rest);

        $start = microtime(true);
        for ($i = 0; $i < 500; $i++) {
            $parser->try('$.q.w[@1].e[2]@int');
        }
        $end = microtime(true);

        $this->assertLessThan(0.1, $end - $start);
    }

Thanks again for your work!

mathiasverraes commented 4 years ago

This commit shows that the problem with either is fixed: https://github.com/mathiasverraes/parsica/commit/75761c77dfb5934b9d914f3ce7106bccee06811b

(It's actually fixed as a consequence of something else, we swapped the implementation of or to make sure it only executes the second parser if the first one failed.)

This change is in the main branch and will be in 0.4. There are lots of breaking changes so beware.

mathiasverraes commented 4 years ago

I've also added your other test. Would be possible to share your original parser? That way, we can define the test as a comparison, where the Parsica version must perform at most x% slower than the original one.

mathiasverraes commented 4 years ago

BTW I'm sure you know this, but if xdebug is on, it makes that test about 7x slower on my machine. It's still too slow, of course. I'm hoping to focus on performance after I get v0.4 out the door.

grifx commented 4 years ago

if xdebug is on, it makes that test about 7x slower on my machine. It's still too slow, of course.

Good point, I probably had it on.

--

The original parser isn't open-sourced and does much more than parsing. I'll try to implement another Parser using imperative code by the end of this week 🤞 so we can have a proper comparison.

grifx commented 4 years ago

FYI (without xdebug)

Imperative Current draft: https://gist.github.com/grifx/1efe84852f2e4dd867793d66149d152b Time: 00:00.046, Memory: 6.00 MB Parsica Time: 00:00.672, Memory: 6.00 MB

        $context = new ParsingContext();
        for ($i = 0; $i < 1000; $i++) {
            $context->startParsingKey(0, '$.qwerty.qwerty[@qwerty].qwerty[qwerty]@qwerty');
            $this->imperativeKeyParser->try('$.qwerty.qwerty[@qwerty].qwerty[qwerty]@qwerty', $context);
            $context->endParsingKey();
            $context->clean();
        }

        $context = new ParsingContext();
        for ($i = 0; $i < 1000; $i++) {
            $context->startParsingKey(0, '$.qwerty.qwerty[@qwerty].qwerty[qwerty]@qwerty');
            $this->parsicaKeyParser->try('$.qwerty.qwerty[@qwerty].qwerty[qwerty]@qwerty', $context);
            $context->endParsingKey();
            $context->clean();
        }

I'll try to clean, extract and share the unit test during the week.

Note: They are not behaving exactly the same way. For instance, the imperative parser deals with char escaping: $.www\.gooo\\\.oooogle\.com. I haven't thought about how to do it with Parsica yet.

mathiasverraes commented 4 years ago

The new JSON parser can escape characters in string literals: https://github.com/mathiasverraes/parsica/blob/main/src/JSON/JSON.php#L214

mathiasverraes commented 4 years ago

Should we close this? I'm not sure if there's anything actionable right now.

grifx commented 4 years ago

Sorry for the late reply.

Yes, let's close this issue.

I'll release the library using the imperative parser and open a PR to introduce Parsica.

Thanks for your help!

grifx commented 4 years ago

@mathiasverraes FYI, I cannot close this issue since I did not open it.

parsica-php / parsica

parse efficiently ordered file paths to build a tree #6