CPAN Pull Request Challenge: Pull Request Preview: Tidying Up (Requesting Comments)

bambams commented 9 years ago

Hello,

As mentioned in the other issue I have been assigned this CPAN module as a pull request challenge assignment for January. I have begun attempting to tidy things up where I can. The work is currently a work in progress in my wip/tidy branch, which I am overwriting regularly (push -f).

A snapshot has been pushed to preview/tidy/1 (https://github.com/bambams/p5-Plack-Session-State-URI/tree/preview/tidy/1) which I will leave intact for your review (it is static). Please review it and let me know if there are any changes you like, changes you hate, and if there's some common ground to be met. I'd really appreciate you helping to steer me in the correct direction so my time isn't wasted and neither is yours. If there are things you want, but others you don't, let me know and I'll try to do the work of rebasing the branch until you're happy with it. Thanks.

s-aska commented 9 years ago

Hello

I was sure that there is no problem in this pull-req :)

I will be happy to merge.

Thanks.

bambams commented 9 years ago

Thanks. I will try to finalize something and submit an official pull request when I'm ready. Or you can feel free to merge anytime you wish and I'll work around the upstream history.

I am also considering replacing the regex <input> injection with a proper HTML parser. In theory, one of the <form> attributes could contain > and break the regex... I just am not familiar with any CPAN HTML parsers and I'm having trouble finding an appropriate one to use.

s-aska commented 9 years ago

Is not recommended because there is influence on Performance Using HTML parser. I will fix if you give me fail to the test case.

bambams commented 9 years ago

Yes, performance was a consideration of mine as well. In my experience, parsing SGML based dialects is faster than you'd imagine. I have implemented Web applications that parse complex, user uploaded HTML data (albeit, with .NET) and the performance hit is negligible. I'd imagine that would be true in this case as well.

My concern is that it isn't possible to parse all possible HTML markup with just a regular expression. It will come down to whether it has to always work or whether performance matters more than working every time.

I will try to put together some test cases that break the regular expression and we can go from there. :)

If you're sure you want to keep the regular expression parsing I was thinking that perhaps we could add a configurable branch on the HTML transformation to choose the desired implementation. Depending on how popular the module is, we can either leave the legacy behavior as default and require a constructor argument to enable a proper HTML parser for users that need or want it; or visa-versa. E.g.,

Plack::Session::State::URI->new(session_key => 'sid', html_parser => 1);
    or
Plack::Session::State::URI->new(session_key => 'sid', html_regex => 1);

That way the user can do what is necessary, and we can choose an appropriate default for the users that don't care. The performance hit will be optional. :) I'll try to figure out the tests and create some nasty HTML samples. :)

s-aska commented 9 years ago

html_parser => 1 It is a good idea!!

I am using HTML::StickyQuery（HTML::Parser） for <a href="...">.

https://github.com/s-aska/p5-Plack-Session-State-URI/blob/master/lib/Plack/Session/State/URI.pm#L46

It might be best to together <form> and <a>.

Performance than now to improve :)

bambams commented 9 years ago

I see that HTML::StickyQuery is meant to add parameters to all <a> tag hyperlinks, and it is based on HTML::Parser. However, it appears that HTML::StickyQuery doesn't provide any hooks into the parsing process so that we could add <form> too. It seems that HTML::FillInForm might do exactly what we need for the form, and it too uses HTML::Parser. We just have to figure out how to use both without having to parse twice. Perhaps it will require sending patches upstream to these two other modules. Studying the code of each, it looks to me like the base HTML::Parser just calls a given method for start tags, end tags, comments, text, etc., and these process the markup data and concatenate to the $self->{output} string.

Firstly, it sounds very inefficient to concatenate onto the output string that many times (it would probably create a ton of temporary strings, whereas a smarter, lower-level interface to Perl strings could concatenate them all at once). That seems like it might be a design flaw of HTML::Parser, but I'm not familiar with the alternatives in Perl (e.g., in .NET you have a StringBuilder class).

Secondly, this makes it difficult to combine both effects without parsing twice. Instead of manipulating the tag text through an API, allowing multiple processes to have their turn, they just assume that they own the entire parser and modify the output string. Perhaps it would work though if we implemented a wrapper that checked the tag name and dispatched to each type of parser's methods... E.g., (this is a big hack just to demonstrate a POSSIBLE idea)

package HTML::StickyParams;

use parent qw/HTML::Parser/;

use HTML::FillInForm;
use HTML::StickyQuery;

sub new {
    my ($class) = @_;

    my $fif = HTML::FillInForm->new( ... );
    my $stq = HTML::StickyQuery->new( ... );
    my $self = bless { %$fif, %$stq }, $class;

    $self->{fif_} = bless $self, ref $fif;
    $self->{stq_} = bless $self, ref $stq;

    return $self;
}

sub fif_ { shift->{fif_} }
sub stq_ { shift->{stq_} }

sub start {
    my ($self, $tagname, $attr, $attrseq, $origtext) = @_;

    my $child = do {
        if($tag_name eq 'form') {
            $self->fif_;
        } elsif ($tag_name eq 'a') {
            $self->stq_;
        }
    } or return;

    $child->start(@_);
}

In theory something like that could work, but it would be a huge hack that could break when the internals of either package changed, and it would also depend on merging the state of HTML::FillInForm and HTML::StickyQuery into our own HTML::Parser subclass and hoping that nothing collides...

A cleaner approach might be to patch both packages, split their methods (e.g., start) into two: one that describes the work to do, and another that actually does it, preferrably through a standard method that translates some kind of object representation into the actual changes. But that's quite a lot of effort for this little bit. It might be easier to just copy both modules and merge them ourselves into a new package that doesn't depend on either. Then we'd control the code and they couldn't break it on us. The downside is duplicated code...

I hope that there's a much easier way that I'm missing...

s-aska commented 9 years ago

I think that it is good that approach.

s-aska / p5-Plack-Session-State-URI

CPAN Pull Request Challenge: Pull Request Preview: Tidying Up (Requesting Comments) #3