mozilla / fathom

A framework for extracting meaning from web pages
http://mozilla.github.io/fathom/
Mozilla Public License 2.0
1.97k stars 76 forks source link

Adds changes needed to get vectorization to "work" on Windows. #244

Closed danielhertenstein closed 4 years ago

danielhertenstein commented 4 years ago

I do run into an error every time during vectorization though. Regardless of how many samples, threads, delay, I always hit a Error: can't access property "browser", tab is null during vectorization. I don't think I've hit this before. I want to get this over to @erikrose to see if the changes I made to get autovec working on Windows cause the same problems on Mac.

danielhertenstein commented 4 years ago

Interestingly this now fails on Linux while working on Windows. I'll get WSL put together so I can dig into it.

erikrose commented 4 years ago

Yes, it fails on the Mac too. The Ruleset menu on the Vectorizer page is empty. Autovectorization reports Configuring Vectorizer...Error: Couldn't find trainee ID "body" in your rulesets., but I can see it more readily by running fathom-fox.

erikrose commented 4 years ago

Getting rid of shell=True fixes it.

erikrose commented 4 years ago

The breakage is during the call to rollup.

erikrose commented 4 years ago

Here's the problem: when you set shell=True, args is expected to be a string, not an array of args. So, changing the line to this works:

return subprocess.run(' '.join(args), cwd=cwd, shell=True, capture_output=True, check=True)

Of course, that would require some quoting to be safe, and quoting is not my favorite. Is the need for the shell due to the need to interpret a .cmd file?

danielhertenstein commented 4 years ago

I'm looking into the need of shell right now.

erikrose commented 4 years ago

I wonder if you could explicitly say cmd geckodriver.cmd or something.

danielhertenstein commented 4 years ago

One of the three cases requires getting the JS file instead of the .cmd, so I only addressed the other two in order to keep the factored out function simple.