Closed Hrxn closed 6 years ago
try this in test_b.py in stead of your url_set line: url_set = open('example_accs.txt', encoding='utf-8')
Thanks for your response!
Unfortunately, still the same error.
PS D:\Test> python test_b.py
URL;Account Name;Displayed Full Name;Submitted Posts;Followers
Traceback (most recent call last):
File "test_b.py", line 21, in <module>
handle = r.html.find(sel_handle, first=True).text
AttributeError: 'NoneType' object has no attribute 'text'
I think I've figured it out...
Here's a gist with a working version: https://gist.github.com/Hrxn/6befd33a59735117e678de4990efd758
What I basically changed is only this:
open('ig_accs.txt', mode='rt', encoding='utf-8', errors='strict', newline=None)
newline=None
to enable "Universal Newline" mode in Python (should also be default), and..
for line in url_set:
entry = line.replace('\n', '')
r = session.get(entry)
r.html.render()
[...]
entry = line.replace('\n', '')
So it was actually caused by the line termination. requests-html can't handle \n
in the string representing an URL, but this is not a correct URL, technically, so I guess it is kinda expected.
I don't know why I did not notice this during testing.. Another Python test script I made did not have this problem as well, but I think in this case it could be because I used Regex on the input..
Hey, I've just been experimenting a bit with requests-html, works as described so far, but I've encountered a strange issue that does not really seem to make any sense to me.
Before I go on, I have basically zero experience with Python, so please bear that in mind.. 😅
Best to describe my use case I was testing: Scraping info from Instagram, one of those 'modern' web sites ("web app") that serve and render all content via JavaScript, making my old school parsing scripts pretty useless, but that's what
html.render()
is for, I suppose..So, the thing is, that code here is working as expected:
test_a.py
But this code here does not work, although it should do the exact same thing:
test_b.py
The only difference is that in
test_a.py
I use URLs that are stored in a list inside the source code, while intest_b.py
I load those exact same URLs from an external text file.For completeness, here is the content of
example_accs.txt
Here is the error I get with
test_b.py
:I understand that,
html.find
returnsNone
, because the selector did not find anything. But why is this only happening intest_b.py
?I tried to debug a bit around, of course, and the only relevant thing I've found so far is this, I think:
Fiddling with some
print(r.html)
inside the loops, I saw this:vs.
There is a difference in the
url
attribute (%0A), I don't know where that comes from.. Looks like some line ending shenanigans, but If I do this:Everything looks normal... (I found out that Python automatically adds a
\n
when usingprint
, but this isn't the issue itself, e.g.)... looks just as expected.
Okay, just the usual background info. Please let me know if any relevant info is missing: