getText() returns text other drivers does not

minkphp / MinkBrowserKitDriver

Symfony2 BrowserKit driver for Mink framework

MIT License

549 stars 80 forks source link

getText() returns text other drivers does not #153

Open alexpott opened 4 years ago

alexpott commented 4 years ago

\Behat\Mink\Driver\BrowserKitDriver::getText() will return text in the head section and also any json on the page that's contained in a script tag in the HTML body. \Behat\Mink\Driver\Selenium2Driver::getText(), for example, will not return text from the head section or script tags in the body section. Given the Mink documentation states:

getText() will strip tags and unprinted characters out of the response, including newlines. So it’ll basically return the text that the user sees on the page.

I'm not sure if this is a Symfony\DomCrawler issue or not.

See for a discussion of the affects of this - https://www.drupal.org/project/drupal/issues/3175718

jonathanjfshaw commented 4 years ago

DomCrawler is simply using php's DOMNode: https://www.php.net/manual/en/class.domnode.php#domnode.props.textcontent which is implementing the W3c spec: https://www.w3.org/TR/2003/WD-DOM-Level-3-Core-20030226/DOM3-Core.html#core-ID-1312295772

alexpott commented 4 years ago

@jonathanjfshaw yep and it's returning what document.body.textContent in the browser console does. The point is that this is not what \Behat\Mink\Driver\Selenium2Driver::getText() returns and it is returning stuff that is not visible.

aik099 commented 4 years ago

I see no issue here.

The Selenium driver is talking to a real browser and can ask to return only text visible to a user. The BrowserKit being a headless driver only looking at HTML tags and parsing them to its knowledge. This way stripping all HTML tags will leave their content in place resulting in the effect you're getting.

@alexpott , I'm recommending to use the getText method on the BODY NodeElement (PHP class in Mink) of the document, not the whole document. This way you won't get any extra stuff (at least I hope so).

Code below (maybe not working) is how I'll be getting the contents of a document.

$body_text = $session->getPage()->find('xpath', '//body')->getText();

alexpott commented 3 years ago

@aik099 body can contain script tags. Adding script tags just before closing the body tag is often advocated for performance reasons.