Scraping does not work because mmi html is invalid

florpor commented 10 years ago

I was playing around with the scraping code and couldn't figure why it's not working. Apparently mmi in their IturTabot page currently serve a html with a closing tr tag which does not have an opening one, so when beautifulsoup parses the html with lxml as parser it reaches this point and from thereon just closes the open tags it has, leaving the information tables out of the returned object. Then, when a table of class highLines is searched for we just take the first result, but there are no results. The error in heroku logs:

2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: checking gush 30727
2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true
2013-12-08T23:09:11.482607+00:00 app[scheduler.7645]: HTML new, inserting data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: [2013-12-08 23:09] ERROR: horse: IndexError: list index out of range
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: Traceback (most recent call last):
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/.heroku/python/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     rv = job.perform()
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/.heroku/python/lib/python2.7/site-packages/rq/job.py", line 328, in perform
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     self._result = self.func(*self.args, **self.kwargs)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/tools/scrapelib.py", line 139, in scrape_gush
2013-12-08T23:09:11.537331+00:00 app[scheduler.7645]: [2013-12-08 23:09] INFO: worker: *** Listening on high, default, low...
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     data = extract_data(html)
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:   File "/app/tools/scrapelib.py", line 49, in extract_data
2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]:     table = s("table", "highLines")[0]
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: IndexError: list index out of range
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: 
2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: [2013-12-08 23:09] DEBUG: horse: Invoking exception handler <bound method Worker.move_to_failed_queue of <rq.worker.Worker object at 0x19c7290>>
2013-12-08T23:09:11.518779+00:00 app[scheduler.7645]: [2013-12-08 23:09] WARNING: horse: Moving job to failed queue.

Seems like a solution could be changing the parser beautifulsoup uses from lxml to html5lib. It seems to work for me so far, still looking into that. Right now though - no new data is being fetched.

alonisser commented 10 years ago

A government site with non-compliant html? can't be...

Mor - great for locating the problem. as for the solution:

As mention here html5lib is very slow. maybe there is a way around this . Did you ask in beautifulsoup google group ? another way might be downloading the html, adding the missing </tr> with some regex and text replacing and then parsing with lxml.

Twitter:@alonisser https://twitter.com/alonisser LinkedIn Profile http://www.linkedin.com/in/alonisser Facebook https://www.facebook.com/alonisser _Tech blog:_4p-tech.co.il/blog _Personal Blog:_degeladom.wordpress.com Tel:972-54-6734469

On Mon, Dec 9, 2013 at 1:27 AM, florpor notifications@github.com wrote:

I was playing around with the scraping code and couldn't figure why it's not working. Apparently mmi in their IturTabot page currently serve a html with a closing tr tag which does not have an opening one, so when beautifulsoup parses the html with lxml as parser it reaches this point and from thereon just closes the open tags it has, leaving the information tables out of the returned object. Then, when a table of class highLines is searched for we just take the first result, but there are no results. The error in heroku logs:

2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: checking gush 30727 2013-12-08T23:08:39.661518+00:00 app[scheduler.7645]: http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true 2013-12-08T23:09:11.482607+00:00 http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true2013-12-08T23:09:11.482607+00:00 app[scheduler.7645]: HTML new, inserting data 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: [2013-12-08 23:09] ERROR: horse: IndexError: list index out of range 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: Traceback (most recent call last): 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/.heroku/python/lib/python2.7/site-packages/rq/worker.py", line 393, in perform_job 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: rv = job.perform() 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/.heroku/python/lib/python2.7/site-packages/rq/job.py", line 328, in perform 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: self._result = self.func(_self.args, *_self.kwargs) 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/tools/scrapelib.py", line 139, in scrape_gush 2013-12-08T23:09:11.537331+00:00 app[scheduler.7645]: [2013-12-08 23:09] INFO: worker: *\ Listening on high, default, low... 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: data = extract_data(html) 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: File "/app/tools/scrapelib.py", line 49, in extract_data 2013-12-08T23:09:11.518180+00:00 app[scheduler.7645]: table = s("table", "highLines")[0] 2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: IndexError: list index out of range 2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: 2013-12-08T23:09:11.518482+00:00 app[scheduler.7645]: [2013-12-08 23:09] DEBUG: horse: Invoking exception handler <bound method Worker.move_to_failed_queue of <rq.worker.Worker object at 0x19c7290>> 2013-12-08T23:09:11.518779+00:00 app[scheduler.7645]: [2013-12-08 23:09] WARNING: horse: Moving job to failed queue.

Seems like a solution could be changing the parser beautifulsoup uses from lxml to html5lib. It seems to work for me so far, still looking into that. Right now though - no new data is being fetched.

— Reply to this email directly or view it on GitHubhttps://github.com/niryariv/opentaba-server/issues/40 .

niryariv commented 10 years ago

are you certain that's the issue? i've noticed that the URLs have been returning a .asp file source - seems like a MIME issue on their servers - instead of the HTML.

I see it both on the URLs we use (eg http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true ) and when trying use the MMI site on the browser (ie http://mmi.gov.il/IturTabot/taba1.asp ) - I assumed that was what's stopping the scraping, but since I only discovered it on saturday I thought they might fix it in a couple days.. apparently not

florpor commented 10 years ago

I guess I picked a bad gush from the log and now I can't get the logs again without my computer... anyways I'm sure that some URLs work: http://mmi.gov.il/IturTabot/taba2.asp?Gush=360&fromTaba1=true (it's one in ashkelon I think) I'll look into the jerusalem gushim tomorrow night...

@alonisser - yes html5lib is much slower than lxml, but it's much more flexible and format-error tolerant. considering that it takes 5-10 seconds just to connect to the mmi web server I think it shouldn't bother us and that it's better than regexing this one error which might appear again in other parts of the page later on and require more fixing.

still gotta make sure this is really what's happening on prod. will let you know

shevron commented 10 years ago

I've seen that too and I believe it only happens on some gushim pages and indicates a server - side crash. Not sure what we can do about it.

As for lxml vs html5lib I believe lxml has some kind of "html" mode which should be more tolerant, I'm not sure if it can be used and if it makes any difference (perhaps it is the mode which is used by BeutifulSoup in the first place). In any case I agree with @florpor that it will most likely be a negligible performance impact since it is not an on-line process and it is marginal compared to the time it takes to get the data from MMI servers.

On Dec 9, 2013 11:55 AM, "Nir Yariv" notifications@github.com wrote:

are you certain that's the issue? i've noticed that the URLs have been returning a .asp file source - seems like a MIME issue on their servers - instead of the HTML.

I see it both on the URLs we use (eg http://mmi.gov.il/IturTabot/taba2.asp?Gush=30727&fromTaba1=true ) and when trying use the MMI site on the browser (ie http://mmi.gov.il/IturTabot/taba1.asp ) - I assumed that was what's stopping the scraping, but since I only discovered it on saturday I thought they might fix it in a couple days.. apparently not

— Reply to this email directly or view it on GitHubhttps://github.com/niryariv/opentaba-server/issues/40#issuecomment-30118819 .

niryariv commented 10 years ago

I agree performance is here is at much lower priority than handling the HTML.

@florpor are you certain the issue is caused by the missing <tr>? I get the same error when trying to scrape gush 30027 - which returns the bad .asp I mentioned above - so assumed it's just because the HTML output didn't have the table the code is looking for.

Did you try downloading gush 360 HTML and seeing if the code parses it?

alonisser commented 10 years ago

I think our main problem is that we are doing exploratory debuging instead of writting the proper granular unit tests for the parser. I guess that if we did write those, with granular types of malformed html, mime, etc, we already had the answer what goes wrong and could solve it/write a try/except around it. not blaming any one (as you know, testing is part of my responsibility is this project) I just think we won't know for sure without this.

Twitter:@alonisser https://twitter.com/alonisser LinkedIn Profile http://www.linkedin.com/in/alonisser Facebook https://www.facebook.com/alonisser _Tech blog:_4p-tech.co.il/blog _Personal Blog:_degeladom.wordpress.com Tel:972-54-6734469

On Wed, Dec 11, 2013 at 10:02 AM, Nir Yariv notifications@github.comwrote:

I agree performance is here is at much lower priority than handling the HTML.

@florpor https://github.com/florpor are you certain the issue is caused by the missing ? I get the same error when trying to scrape gush 30027 - which returns the bad .asp I mentioned above - so assumed it's just because the HTML output didn't have the table the code is looking for.

Did you try downloading gush 360 HTML and seeing if the code parses it?

— Reply to this email directly or view it on GitHubhttps://github.com/niryariv/opentaba-server/issues/40#issuecomment-30301176 .

niryariv commented 10 years ago

Completely agree - we should have a full test suite for the parser, and have it run once a day (or however often we'll be parsing). Another great task for the Hackathon, which with the snow we're having now I'm pretty sure I won't be attending ;)

On Wed, Dec 11, 2013 at 11:36 PM, Alonisser notifications@github.comwrote:

I think our main problem is that we are doing exploratory debuging instead of writting the proper granular unit tests for the parser. I guess that if we did write those, with granular types of malformed html, mime, etc, we already had the answer what goes wrong and could solve it/write a try/except around it. not blaming any one (as you know, testing is part of my responsibility is this project) I just think we won't know for sure without this.

Twitter:@alonisser https://twitter.com/alonisser LinkedIn Profile http://www.linkedin.com/in/alonisser Facebook https://www.facebook.com/alonisser _Tech blog:_4p-tech.co.il/blog _Personal Blog:_degeladom.wordpress.com Tel:972-54-6734469

On Wed, Dec 11, 2013 at 10:02 AM, Nir Yariv notifications@github.comwrote:

I agree performance is here is at much lower priority than handling the HTML.

@florpor https://github.com/florpor are you certain the issue is caused by the missing ? I get the same error when trying to scrape gush 30027 - which returns the bad .asp I mentioned above - so assumed it's just because the HTML output didn't have the table the code is looking for.

Did you try downloading gush 360 HTML and seeing if the code parses it?

— Reply to this email directly or view it on GitHub< https://github.com/niryariv/opentaba-server/issues/40#issuecomment-30301176>

.

— Reply to this email directly or view it on GitHubhttps://github.com/niryariv/opentaba-server/issues/40#issuecomment-30366495 .

florpor commented 10 years ago

Sorry for the downtime.. So yeah, it's kinda hard proving that it actually happens on our production because the gushim i got from the logs are all either duplicate htmls (not updated since last scrape), which is checked before the parsing, or they give the index error because the site returns an error page (got asp code, nice job mmi!). i tried about 30 of them before i gave up. On my system i have a slightly different lxml version than in the requirements.txt file (mine is 3.2.4 as opposed to 3.2.3), but i just ran the (slightly modified) code against a gush that is duplicate according to heroku logs (scraped already - number 30649) and i do get the error with lxml and not with html5lib. Agreed about the tests. we could really start at the hackathon.

alonisser commented 10 years ago

ok.@florpor - can you compile some urls, with or without problems?, so we can download to build the specifc use cases. - we don't need full website. but specific cases of malform html,

florpor commented 10 years ago

apparently only happens on my system... reason is still unknown. @alonisser started writing some parse tests and already merged them. i think this bug can be closed.

alonisser commented 10 years ago

we also found out that mmi site is crashing on every gush that has more then 10 plans, mor opened an "Issue" with them. we still need to find out are there plans that do appear on mmi and don't crash the site but don't appear in our scrapers..

niryariv / opentaba-server

Scraping does not work because mmi html is invalid #40