taylordotfish / librecaptcha

A free/libre interface for solving reCAPTCHA challenges
GNU General Public License v3.0
48 stars 8 forks source link

dependencies, JavaScript parsing, alternatives #20

Open pabs3 opened 3 years ago

pabs3 commented 3 years ago

I was trying to package esprima-python for Debian when I noticed that esprima-python is not well maintained and its tests fail and failures have been going up with new Python releases. It seems that slimit is also not well maintained either, I guess that is why you wanted to switch from it. I note there is another translation of esprima to Python called pyjsparser, unfortunately it is also not maintained.

Most of the JavaScript parsers I can find come in these categories:

1) not well maintained Python libraries (eprima-python/slimit/pyjsparser)

2) well (or not) maintained but written in JavaScript (esprima/etc)

3) aimed at executing JS in browsers/elsewhere (duktape, v8, mozjs, spidermonkey etc) well maintained but written in C/C++, not aimed at exporting parsing details and usually don't have good AST support, often have no Python bindings or are not great for use outside browsers .

4) aimed at programming language tools and editors (things like antlr, tree-sitter and the Language Server Protocol ecosystem). I don't know much about these, some of them may be well supported. tree-sitter Python bindings don't seem to be packaged yet for Debian, nor are there many tree-sitter grammars in Debian, although one exists for JS.

I got a bit overwhelmed from all the complexity needed to parse JS from Python, so I started wondering about a different way to do it. I noticed that librecaptcha only extracts strings from the JavaScript AST for one of the reCAPTCHA JS files. It also only uses strings that start with a certain prefix. Then I looked at the reCAPTCHA JS file that librecaptcha parses and noticed two things:

The strings that are needed should be just as easily extracted using a regex instead of JS parsing and AST walking.

One of the strings is constructed from two constant strings and a function call, so the AST walking method is likely to sometimes get the wrong information, unless that is handled at some point. The regex one of course will also get the wrong info but won't be able to do any better. Improving the AST walking method to fix this seems like it would be very hard to do though.

Any thoughts?

pabs3 commented 3 years ago

Looking more closely at how the extracted strings are used and at the js-beautify output of the JavaScript that is being parsed, I can see that anything other than string extraction from parsed JavaScript isn't really going to work; since the IDs are in separate strings and there can be multiple IDs for each "Select all ..." string.

pabs3 commented 3 years ago

Looking at the js-beautify output of the JavaScript that is being parsed, I also see some strings that don't match "Select all ..." but look like they are reCATCHA instructions. Some of these don't follow the pattern of having an ID string before them. Also, some of the "Select all ..." strings have an ID in the form of A[1] before them instead of an ID string.

pabs3 commented 3 years ago

Also, some of the ID strings have instructions in the form of d[1] rather than "Select all ..." strings.

pabs3 commented 3 years ago

To solve the issues with ID strings and JS parsing, you could copy all the strings into the librecaptcha codebase and curate them a bit, but they are possibly copyrightable so that wouldn't be a good idea.

pabs3 commented 3 years ago

So the current JS parsing approach is the least worst approach. None of the other JS parser options seem to be suitable at this time, so probably this bug should just be closed?

I am likely to keep using the slimit module with the Debian librecaptcha package, because while both slimit and esprima are unmaintained, esprima seems to regress more with newer versions of Python.

taylordotfish commented 3 years ago

Yeah, Python JavaScript parsing is not in the best state.

I'm somewhat surprised that esprima-python has so many problems. Do you know what the causes of some of the regressions are? I thought Python usually tries to be at least halfway backwards compatible.

The only other idea I have is to use the original JavaScript esprima (which esprima-python is based on) as a subprocess, but that also doesn't seem like a great solution, as it adds complexity.

pabs3 commented 3 years ago

I haven't had the time/inclination to look at the esprima-python issues, from memory there were some AST output differences.

I note that the JavaScript esprima (and a fork of it by Facebook) are both in Debian already, so a subprocess would be fine by me although it indeed would add quite a bit of complexity especially for people installing librecaptcha through PyPI instead of Debian.

-- bye, pabs

https://bonedaddy.net/pabs3/

taylordotfish commented 3 years ago

Well, given that there's already dual support for slimit and esprima-python, I suppose switching that to esprima-python and esprima wouldn't be too bad.

pabs3 commented 3 years ago

Since this is an area where there isn't much maintenance and it is unpredictable which dependencies will be better maintained at any one time and which are available in distros, I'd keep slimit support and probably add several alternatives for JavaScript parsing dependencies.

-- bye, pabs

https://bonedaddy.net/pabs3/