hoaproject / File

The Hoa\File library.
https://hoa-project.net/
323 stars 20 forks source link

Add glob support in the finder #22

Closed shulard closed 8 years ago

shulard commented 9 years ago

Fix #18.

With that update if you specify a path like "/emp" or "/{a,b,c}.php", it will be processed nicely.

Example:

<?php
$finder = (new Finder())->in(["/*emp", "/{a,b,c}*.php"]);

Does not use the Iterator\Glob implementation here because it does not have the GLOB_ONLYDIR filter. It can be replaced with a combination of Iterator\CallbackFilter and Iterator\Glob but it seems more complex which is not suitable here...

Hywan commented 9 years ago

@shulard Why GLOB_ONLYDIR is required?

Hywan commented 9 years ago

@shulard How performances are impacted? The fact we are applying glob in all cases makes me unconfortable. Is it possible to detect a glob pattern when using in and doing the appropriated computations here?

shulard commented 9 years ago

I used the GLOB_ONLYDIR because glob list all items by default (files and directories). The path is used to build a Iterator\Recursive\Directory which only handles directories. It's iterator's goal to retrieve all the items that must be listed.

shulard commented 9 years ago

@Hywan maybe we can add a pattern to avoid applying glob on each path. I though that glob default behaviour is to handle that pattern. It always return an array with the matching paths (with dynamic patterns or not). So it's the same as applying a pattern before using it for me :smile:

Also when glob will be able to handle more pattern cases, we can use it directly...

But I don't know about the performance... For me :

shulard commented 8 years ago

Hello,

With my last comment, do you think we really need to apply a glob pattern on each path before calling glob?

Hywan commented 8 years ago

Can you compare the performances between GlobIterator + FilterIterator vs. RegexIterator vs. glob? I reckon RegexIterator may be faster than GlobIterator and a “glob2pcre” algorithm is not very complicated at all.

shulard commented 8 years ago

Hello @Hywan,

Of course, I can compare performance and I'll submit the bench here. Yes the "glob2pcre" algorithm is not very complicated and it allow us to rely only on Iterators...

shulard commented 8 years ago

Hello @Hywan,

I've worked on this PR today and I've commited the benchmark code here: https://github.com/shulard/hoa-file-glob-benchmark

I've used the Hoa\Bench lib for timing check and the result is that GlobIterator+CallBackFilter and glob+GLOB_ONLYDIR take the same amount of time.

I'm not particularly familiar with code benchmarking, if you have any ideas to improve the check I'm open to add it !

The test was run with a maxDepth of 5 on different directories :

I can't find a way to implement the RegexIterator method. If we want to use that, we need to crawl the folder and apply the regex generated so there are some problems :

This is a lot of useless computing because it require to develop the whole glob process...

What are you thinking about that update ?

Thanks Stéphane

Hywan commented 8 years ago

@shulard Nice work! See my comment.

shulard commented 8 years ago

@Hywan,

I've updated and rebased the code... Now the in method contains the glob application.

I've also benched the Finder object with a new test : Try to crawl a path without glob pattern inside The goal of this test is to show glob call overhead. The execution time is similar in all the 3 cases :

I've updated my repo to handle that case.

Hywan commented 8 years ago

I am assigning @Jir4 to this issue.

Jir4 commented 8 years ago

I'll take a look tomorrow

Jir4 commented 8 years ago

The PR is good for me, the code is simplier to understand with a simple glob use than with a GlobIteratoruse and the bench are similar.

Hywan commented 8 years ago

IRC logs can be interesting: https://botbot.me/freenode/hoaproject/2016-02-17/?msg=60227978&page=1.

Jir4 commented 8 years ago

So we have to test if we have to use glob or not. Is there a syntax to detect ?

Hywan commented 8 years ago

https://en.wikipedia.org/wiki/Glob_%28programming%29#Syntax

shulard commented 8 years ago

@Jir4, if you want I can add the syntax detection behaviour in the current code (or you want to work on it ?)...

Jir4 commented 8 years ago

If you can do it i think it's simplier

shulard commented 8 years ago

Of course, I'll work on it quickly!

shulard commented 8 years ago

I've used a basic regex to detect all the components of glob syntax: /\*|\?|\[([^\]]+)\]|\{([^}]+)\}/

I've added the braces because in PHP glob it is possible to use them... Maybe we prefer stick on the wikipedia syntax definition ?

I though that the GlobIterator in PHP is not able to handle the braces but it must be checked.

For the moment the implementation is only using the glob function but detect if it's required or not...

Jir4 commented 8 years ago

Can you run some bench ? It can be usefull to compare with and without the preg_match

shulard commented 8 years ago

Yep of course, I'll submit the results here!

shulard commented 8 years ago

I've updated my bench repo here: https://github.com/shulard/hoa-file-glob-benchmark

Now I check with glob, with Iterator\Glob, with detection+glob, with detection+Iterator\Glob:

glob              ||||||||||                                      1677ms,  22.0%
iterator          ||||||||||                                      1723ms,  22.6%
globnodetect      |||||||||||                                     1800ms,  23.6%
iteratornodetect  ||||||||||                                      1667ms,  21.9%

Execution time is really similar with the given path patterns :

All the objects found 42083 files on my computer. I ran the test multiple times, Iterator\Glob results are all consistant but glob can take about 100ms more on some calls... Maybe it's my computer, maybe it's the implementation...

I also confirmed that braces are not handled by the GlobIterator implementation in PHP (but it's not an official feature of glob...).

Hywan commented 8 years ago

:+1: What are your observations then?

shulard commented 8 years ago

I think that we need to use the GlobIterator which is more consistent. Then the brace support is not really important in our case (I think...).

shulard commented 8 years ago

Hello !

What's your POV here ? Do you think we need to use the GlobIterator too ? I can update the PR code with the final method during this week.

Hywan commented 8 years ago

I think GlobIterator is more consistent, but… why does it have less features? Today Hoa\Iterator\Glob extends GlobIterator from SPL (https://github.com/hoaproject/Iterator/blob/60bdefab8db17717871a11101dedec60572f95b8/Glob.php). You may want to ask on php-src internals why glob and GlobIterator do not provide the same features. Based on these answers, we will see what to do next.

Thoughts?

Hywan commented 8 years ago

And yes, sometimes it can be very annoying to develop inside Hoa because we have to triple-check everything and every details, but this is why the quality is high :-p. Also, it often happens that we find inconsistency in PHP API or between PHP VM (remember Hoa\Iterator\RegularExpression —and Hoa\Iterator\Recursive\RegularExpression— with https://github.com/facebook/hhvm/issues/3909 and https://bugs.php.net/bug.php?id=68128). I wonder if we should communicate about this…

shulard commented 8 years ago

@Hywan, no problem about the triple-check :smile:. This process allow me to learn a lot about quality process and reviews, it's awesome to be part of that !

I'll make an email to php-src internal about the difference betweens the 2 globs.

About the discovered inconsistencies, I think Hoa should communicate about it! It'll help move forward the different projects and also show that Hoa is really making deeper research about it's code.

shulard commented 8 years ago

Thread created on php-general mailing list: http://news.php.net/php.general/325485

shulard commented 8 years ago

I got an answer from PHP ML : http://news.php.net/php.general/325487

It seems that glob is a porting of the linux's glob feature. The GlobIterator is more strict regarding the glob pattern implementation and we saw that there is no overhead when using an Iterator combination.

Because the specific glob features are enabled with flags, the GlobIterator is just equivalent to a glob call with no flags.

shulard commented 8 years ago

Hello ! I've updated the implementation to use the Iterator. Is the ML reply enough to validate that choice ?

Hywan commented 8 years ago

@shulard Good work! Very proud of you. We can go for Hoa\Iterator\Glob, it's fine.

shulard commented 8 years ago

Thanks a lot for the review, I've updated the code with your feedback.

The regex pattern is really simpler :smile:.

Hywan commented 8 years ago

@Jir4 Did you find time for the review?

vonglasow commented 8 years ago

ping @Jir4

Hywan commented 8 years ago

I feel really sorry to ask for changes while you are waiting for so long… I will try to be more serious 😄.

shulard commented 8 years ago

Hello, don't worry update the delay 😄 the only important thing is to move ahead ! I'll review your comments asap...

Hywan commented 8 years ago

Ready for a final review before the merge?

shulard commented 8 years ago

Yep, I'm ready for that final review... I've checked CS and it seems that this repo hasn't been standardized for the moment... Code seems clean (just a warning about the @return void in constructor comments...).

Hywan commented 8 years ago

Thank you!

shulard commented 8 years ago

👍 thank you for your patience and your reviews 😄.

Hywan commented 8 years ago

Are you kidding? You're the patience guy, not me… Anyway, this is good work :-).

shulard commented 8 years ago

Maybe my english is not good enough 😄 I just want to thank you for all the feedback you made on that PR, it was not easy... Delay is not a real issue here ! A nice improvement in this lib 👍