indix / web-auto-extractor

Automatically extracts structured information from webpages
MIT License
108 stars 30 forks source link

when parsing "a" or "link" tag, check existence of href attribute #8

Closed andypang closed 8 years ago

andypang commented 8 years ago

I ran into some html that used a 'content' attribute with the 'link' tag rather than 'href'. This caused the parsing to bail as it assumes 'href' exists whenever a tag is 'a' or 'link'.

A simple check that 'href' exists seems like the right fix. In the case a 'content' attribute is used, the tag content is correctly parsed. If the url information is simply missing, we would return null.

Test updated to include this case.

andypang commented 8 years ago

@Vasanth-Indix There are three other instances of link that have an href attribute in the test that cover the link href case. Thanks.

andypang commented 8 years ago

@Vasanth-Indix Let me know if you still need additional test cases.

Vasanth-Indix commented 8 years ago

Not required @andypang. It's good. Am merging.