janih / boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages
2 stars 0 forks source link

ImageExtractor doesn't detect alternative images for Object plugins #36

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
When using the new ImageExtractor <img/> tags placed as alternative content in 
<object /> tags (normally used in video players using flash), images are not 
detected.

It's quite a common practice to embed a video player like:

<object type="application/x-shockwave-flash">
        <param name="movie" value='my.swf'/>
        <param name="quality" value="high"/>
        <param name="allowScriptAccess" value="always"/>
        <param name="allowFullScreen" value="true"/>
        <param name="wmode" value="opaque"/>
        <img src='1328528982826.jpg' alt='yes an alt' title='and a title'/>
        <p>some alternative content</p>
    </object>

What is the expected output? What do you see instead?
These images should be detected as well.

To detect these images you only might need to comment out the line:
//TAG_ACTIONS.put("OBJECT", TA_IGNORABLE_ELEMENT);

from within ImageExtractor.java

Original issue reported on code.google.com by xavi.beu...@gmail.com on 6 Feb 2012 at 3:28

GoogleCodeExporter commented 9 years ago
Thanks! Fixed in r166.

Original comment by ckkohl79 on 21 Mar 2012 at 9:12