gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.38k stars 1.77k forks source link

handleOnXML tries to parse`.xlsx` files #790

Open theseanything opened 1 year ago

theseanything commented 1 year ago

The handleOnXML function attempts to parse responses with the content-type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet. This is because the function looks for any mention of xml in the content type. This results in a parse error when xmlquery.Parse() is called (For example: `encoding/xml.SyntaxError {Msg: "illegal character code U+0003", Line: 1}).

XLSX files packaged as a zip - so can't be directly parsed as XML.

It would be ideal to not try and parse these files, possibly by being more explicit in which content-types we consider to be XML.

theseanything commented 1 year ago

This doesn't only effect xlsx, but also docx, pptx etc.. type documents

theseanything commented 1 year ago

To add to this it would be nice to able to have more granularity over what XML is parsed. For example, we use a OnXML handler to follow links in a XML sitemap, but our site contains many SVGs (image/svg+xml) and RFDs (application/rdf+xml) which also are unnecessarily parsed.