google-code-export / smuto

Automatically exported from code.google.com/p/smuto
0 stars 1 forks source link

Broken movie descriptions on filmweb causes errors in scraper #3

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Scrap "Niedokonczone Zycie"
2. Scrap "Inland Empire"

What is the expected output? What do you see instead?
Description of movie is cut out too early, for example in case of 
"Niedokonczone Zycie" it contains only phrase "Stary ranczer, Einar Gilkyson (".

What version of the product are you using? On what operating system?
svn 17

Please provide any additional information below.
Too short description is caused by incorrect movie description on filmweb - 
text contains html-encoded tags, which are not removed by scraper because tag 
markers are encoded as special chars.

My solution is to add 'fixchars="1"' option to each expression used for 
selecting description - this way 'bad tags' will be decoded and removed by next 
regex expression.

Patch attached.

Original issue reported on code.google.com by andrzej....@gmail.com on 19 Oct 2010 at 8:11

Attachments:

GoogleCodeExporter commented 9 years ago
wprowadziłem twoje propozycje z małymi małymi modyfikacjami - bardzo dzięki

Original comment by smuto.pr...@gmail.com on 27 Oct 2010 at 2:20