pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh
https://pgh-public-meetings.github.io/events/
MIT License
19 stars 66 forks source link

test html file does not include \r #85

Closed thatKennedy closed 4 years ago

thatKennedy commented 4 years ago

was finding some inconsistency between running my tests and running scrapy from command line.

For some reason when I ran the spider from using the scrapy command, the text from the website always began with \r. But when running the tests the html file did not include these \r.

It was easy to fix this problem - just used lstrip('\r') on the text and now it is all consistent. But it is a little strange that there is this difference.

ben-nathanson commented 4 years ago

@thatKennedy What was the website? I'm going to see if I can recreate the issue and probe a solution. I'm assuming pa_utility?

thatKennedy commented 4 years ago

Thanks for checking it out. It was the site for pa_utility: http://www.puc.pa.gov/about_puc/public_meeting_calendar/public_meeting_audio_summaries_.aspx

On Sun, Feb 2, 2020 at 1:14 PM Ben Nathanson notifications@github.com wrote:

@thatKennedy https://github.com/thatKennedy What was the website? I'm going to see if I can recreate the issue and probe a solution. I'm assuming pa_utility?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bonfirefan/city-scrapers-pitt/issues/85?email_source=notifications&email_token=AD2EK6KVX5PNAWCCHKIAYR3RA4EQNA5CNFSM4KLOIRX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKR5F3A#issuecomment-581161708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD2EK6PN55PFE3O7YPS4E3TRA4EQNANCNFSM4KLOIRXQ .

ben-nathanson commented 4 years ago

I was able to recreate this issue by generating the test file from scratch and then copying over your test script and spider script.

tl;dr This is a result of character encoding differences between Windows and Unix/Linux systems.

I ran scrapy shell on that URL and found that this uses utf-8 and is getting served up on a Windows ASP.NET server. Windows encodes new lines differently using \r\n whereas modern Mac OS X and Linux use just \n. I'm assuming you have Linux or Mac OS and when the test file was saved to your machine, it translated the Windows encodings to into Unix style. Scrapy doesn't appear to do this though as when we're using Scrapy the \r are still present. I didn't get a chance to look at it byte by byte, but this seems likely to be the cause for why Scrapy and Pytest are telling different stories.

Here's what it looked like in Scrapy Shell

<article id="content">\r\n\t\t
    <h1>\r\n\tPublic Meeting Agendas\xa0</h1>\r\n
    <p>\r\n\t<a href="/about_puc/pm_videos.aspx" target="_blank">Public Meeting videos</a> are now available on the website.\xa0Video clips of a public meeting are not considered official transcripts. Official transcripts are available from the court reporting service.</p>\r\n
    <p>\r\n\tTo watch the Public Meetings live, you can watch the <a href="/about_puc/live_streaming_video.aspx" target="blank">streaming video</a> of Public Meeting.</p>\r\n
...

And that same blurb as it appears in the test file:

<article id='content'>
                <h1>
    Public Meeting Agendas&nbsp;</h1>
<p>
    <a href="/about_puc/pm_videos.aspx" target="_blank">Public Meeting videos</a> are now available on the website.&nbsp;Video clips of a public meeting are not considered official transcripts. Official transcripts are available from the court reporting service.</p>
<p>
    To watch the Public Meetings live, you can watch the <a href="/about_puc/live_streaming_video.aspx" target="blank">streaming video</a> of Public Meeting.</p>
...

I'm going to close this since we can't write our own operating system, but I will add this in the new docs.

Thanks!

Ben