In transforms, currently a source's url is fetched without specifying user-agent headers. This small PR adds .userAgent("Mozilla") to the line fetching the Document of the url through the Jsoup connection. I hardcoded the value as I saw elsewhere in the codebase doing the same practice. This may be improved by allowing the user-agent to be specified in the configs as part of the transform.
Purpose
When fetching sources in transforms, some servers may block (e.g. 403 Forbidden) due to missing user-agent headers. To fix, set the user-agent to "Mozilla" for the Jsoup connection before fetching the website.
This allows roundabout loading from sources that block requests with missing user-agent headers to work.*
*Assuming they accept "Mozilla" as a valid user-agent header. For the source I'm using, it does.
Relevant Issue(s)
N/A (not sure if I should have created an issue first)
This pull request...
Description
In transforms, currently a source's url is fetched without specifying user-agent headers. This small PR adds
.userAgent("Mozilla")
to the line fetching theDocument
of the url through the Jsoup connection. I hardcoded the value as I saw elsewhere in the codebase doing the same practice. This may be improved by allowing the user-agent to be specified in the configs as part of the transform.Purpose
When fetching sources in transforms, some servers may block (e.g. 403 Forbidden) due to missing user-agent headers. To fix, set the user-agent to "Mozilla" for the Jsoup connection before fetching the website. This allows roundabout loading from sources that block requests with missing user-agent headers to work.*
*Assuming they accept "Mozilla" as a valid user-agent header. For the source I'm using, it does.
Relevant Issue(s)
N/A (not sure if I should have created an issue first)