janih / boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages
2 stars 0 forks source link

Exclude Script tags #7

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
It seems everything is OK except that the exctractor usually includes many 
javascript codes from any side including the one in the demo code. I think this 
can be prevented by removing <script> tags in SAX parsing stage. 

Google Analytics tracker code is extracted as content in many web sites.

You can improve using Readable's algorithm. http://readable-app.appspot.com/

Original issue reported on code.google.com by ahmetalp...@gmail.com on 23 Aug 2010 at 7:31

GoogleCodeExporter commented 9 years ago
This should actually already be excluded. 
Can you please give an example page?

Thanks,
Christian

Original comment by ckkohl79 on 14 Oct 2010 at 8:35

GoogleCodeExporter commented 9 years ago

Original comment by ckkohl79 on 14 Oct 2010 at 8:45

GoogleCodeExporter commented 9 years ago
Works for me, no response by initial submitter, closing as "Invalid"

Original comment by ckkohl79 on 2 Nov 2010 at 3:38