psf / requests-html

Pythonic HTML Parsing for Humansâ„¢
http://html.python-requests.org
MIT License
13.72k stars 978 forks source link

Strange error only when reading URLs from file #185

Closed Hrxn closed 6 years ago

Hrxn commented 6 years ago

Hey, I've just been experimenting a bit with requests-html, works as described so far, but I've encountered a strange issue that does not really seem to make any sense to me.

Before I go on, I have basically zero experience with Python, so please bear that in mind.. 😅

Best to describe my use case I was testing: Scraping info from Instagram, one of those 'modern' web sites ("web app") that serve and render all content via JavaScript, making my old school parsing scripts pretty useless, but that's what html.render() is for, I suppose..

So, the thing is, that code here is working as expected: test_a.py

from requests_html import HTMLSession
session = HTMLSession()

output_csv = 'URL' + ';' + 'Account Name' + ';' + 'Displayed Full Name' + ';' + 'Submitted Posts' + ';' + 'Followers'

sel_handle = '#react-root > section > main > div > header > section > div:nth-of-type(1) > h1'
sel_iposts = '#react-root > section > main > div > header > section > ul > li:nth-child(1) > span'
sel_follow = '#react-root > section > main > div > header > section > ul > li:nth-child(2) > span'
sel_dsname = '#react-root > section > main > div > header > section > div:nth-of-type(2) > h1'

URLSET = [
    "https://www.instagram.com/taylorswift/",
    "https://www.instagram.com/justinbieber/",
    "https://www.instagram.com/adele/",
    "https://www.instagram.com/onedirection/",
]

print(output_csv)

for entry in URLSET:
    r = session.get(entry)
    r.html.render()
    handle = r.html.find(sel_handle, first=True).text
    iposts = r.html.find(sel_iposts, first=True).text
    follow = r.html.find(sel_follow, first=True).text
    dsname = r.html.find(sel_dsname, first=True).text
    output = entry + ';' + handle + ';' + dsname + ';' + iposts + ';' + follow
    print(output)

But this code here does not work, although it should do the exact same thing: test_b.py

from requests_html import HTMLSession
session = HTMLSession()

output_csv = 'URL' + ';' + 'Account Name' + ';' + 'Displayed Full Name' + ';' + 'Submitted Posts' + ';' + 'Followers'

sel_handle = '#react-root > section > main > div > header > section > div:nth-of-type(1) > h1'
sel_iposts = '#react-root > section > main > div > header > section > ul > li:nth-child(1) > span'
sel_follow = '#react-root > section > main > div > header > section > ul > li:nth-child(2) > span'
sel_dsname = '#react-root > section > main > div > header > section > div:nth-of-type(2) > h1'

try:
    url_set = open('example_accs.txt')
except OSError:
    raise SystemExit('[Error] Could not open file "example_accs.txt" ...')

print(output_csv)

for entry in url_set:
    r = session.get(entry)
    r.html.render()
    handle = r.html.find(sel_handle, first=True).text
    iposts = r.html.find(sel_iposts, first=True).text
    follow = r.html.find(sel_follow, first=True).text
    dsname = r.html.find(sel_dsname, first=True).text
    output = entry + ';' + handle + ';' + dsname + ';' + iposts + ';' + follow
    print(output)

The only difference is that in test_a.py I use URLs that are stored in a list inside the source code, while in test_b.py I load those exact same URLs from an external text file.

For completeness, here is the content of example_accs.txt

https://www.instagram.com/taylorswift/
https://www.instagram.com/justinbieber/
https://www.instagram.com/adele/
https://www.instagram.com/onedirection/

Here is the error I get with test_b.py:

PS D:\Test> python test_b.py
URL;Account Name;Displayed Full Name;Submitted Posts;Followers
Traceback (most recent call last):
  File "test_b.py", line 21, in <module>
    handle = r.html.find(sel_handle, first=True).text
AttributeError: 'NoneType' object has no attribute 'text'

I understand that, html.find returns None, because the selector did not find anything. But why is this only happening in test_b.py?

I tried to debug a bit around, of course, and the only relevant thing I've found so far is this, I think:

Fiddling with some print(r.html) inside the loops, I saw this:

PS D:\Test> python .\test_a.py
URL;Account Name;Displayed Full Name;Submitted Posts;Followers
<HTML url='https://www.instagram.com/taylorswift/'>
https://www.instagram.com/taylorswift/;taylorswift;Taylor Swift;144 posts;108m followers
<HTML url='https://www.instagram.com/justinbieber/'>
https://www.instagram.com/justinbieber/;justinbieber;Justin Bieber;4,345 posts;99.7m followers
<HTML url='https://www.instagram.com/adele/'>
https://www.instagram.com/adele/;adele;Adele;339 posts;32.8m followers
<HTML url='https://www.instagram.com/onedirection/'>
https://www.instagram.com/onedirection/;onedirection;One Direction;726 posts;17.1m followers

vs.

PS D:\Test> python test_b.py
URL;Account Name;Displayed Full Name;Submitted Posts;Followers
<HTML url='https://www.instagram.com/taylorswift/%0A'>
<HTML url='https://www.instagram.com/justinbieber/%0A'>
<HTML url='https://www.instagram.com/adele/%0A'>
<HTML url='https://www.instagram.com/onedirection/%0A'>

There is a difference in the url attribute (%0A), I don't know where that comes from.. Looks like some line ending shenanigans, but If I do this:

for entry in url_set:
    print(entry)

Everything looks normal... (I found out that Python automatically adds a \n when using print, but this isn't the issue itself, e.g.)

for entry in url_set:
    print(entry, end='')

... looks just as expected.

Okay, just the usual background info. Please let me know if any relevant info is missing:

Python: 3.6.5 x64
OS: Windows 10 x64 v1803
Packages: All latest via pip
negresit commented 6 years ago

try this in test_b.py in stead of your url_set line: url_set = open('example_accs.txt', encoding='utf-8')

Hrxn commented 6 years ago

Thanks for your response!

Unfortunately, still the same error.

PS D:\Test> python test_b.py
URL;Account Name;Displayed Full Name;Submitted Posts;Followers
Traceback (most recent call last):
  File "test_b.py", line 21, in <module>
    handle = r.html.find(sel_handle, first=True).text
AttributeError: 'NoneType' object has no attribute 'text'
Hrxn commented 6 years ago

I think I've figured it out...

Here's a gist with a working version: https://gist.github.com/Hrxn/6befd33a59735117e678de4990efd758

What I basically changed is only this:

open('ig_accs.txt', mode='rt', encoding='utf-8', errors='strict', newline=None)

newline=None to enable "Universal Newline" mode in Python (should also be default), and..

for line in url_set:
    entry = line.replace('\n', '')
    r = session.get(entry)
    r.html.render()
  [...]

entry = line.replace('\n', '') So it was actually caused by the line termination. requests-html can't handle \n in the string representing an URL, but this is not a correct URL, technically, so I guess it is kinda expected.

I don't know why I did not notice this during testing.. Another Python test script I made did not have this problem as well, but I think in this case it could be because I used Regex on the input..