mathiasbynens / dotfiles

:wrench: .files, including ~/.macos — sensible hacker defaults for macOS
https://mths.be/dotfiles
MIT License
30.14k stars 8.74k forks source link

Consider changing curlrc user agent #542

Open samskiter opened 9 years ago

samskiter commented 9 years ago

Spent two days hunting down why curlrc'ing a file from sourceforge was downloading a html page rather than a file... the reason? This:

# Disguise as IE 9 on Windows 7.
user-agent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

This causes sourceforge to return the 'your download will begin shortly...' page rather than the file itself.

Tatsh commented 9 years ago

In that case use \curl -q (skip aliases and do not use any configuration file). You may want to create an alias for your use case with SourceForge.

samskiter commented 9 years ago

What is the user-agent there for currently?

Obviously once | found the setting, i was able to quickly remedy it. The issue is more that it took me 2 days to work out what was going wrong

jrahmy commented 9 years ago

I wouldn't be surprised if there's plenty of things that block the traditional cURL useragent, since it's commonly used for page scraping and writing bots/scripts.

If cURL were ever behaving weird for me, the configuration would be the first thing I would check.

Tatsh commented 9 years ago

I was going to say the same thing. There are plenty of sites that block cURL so the would-be hacker who does not know how to change the user agent will just be blocked. Even if they do not block this way now, they might look at their logs later and block at some point in the future. wget is also often blocked.

To me, having a real browser as the default user agent is normal. I have not experienced many issues. When I do, I either change the user agent or use curl -q to attempt with no configuration.

I use a different user agent then this because too many sites give strange content to IE users compared to Chrome/Safari/Firefox but regardless, it is in most cases better to receive the browser content rather than possible 'cURL content', or worse get blocked.

In your use case with SourceForge, I am not sure what page you are referring to. If you can, use wget without a browser agent for downloading files. The syntax is much nicer.

This is my ~/.curlrc:

verbose
user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"

You may want to have verbose set to on always to ensure you know fully what you are sending.

My ~/.wgetrc is similar (although Wget has some nice features for file downloading that cURL really does not):

# Use the server-provided last modification date, if available
timestamping = on

# Do not go up in the directory structure when downloading recursively
no_parent = on

# Wait 60 seconds before timing out. This applies to all timeouts: DNS, connect
# and read. (The default read timeout is 15 minutes!)
timeout = 60

# Retry a few times when a download fails, but don’t overdo it. (The default is
# 20!)
tries = 3

# Retry even when the connection was refused
retry_connrefused = on

# Use the last component of a redirection URL for the local file name
trust_server_names = on

# Follow FTP links from HTML documents by default
follow_ftp = on

# Add a `.html` extension to `text/html` or `application/xhtml+xml` files that
# lack one, or a `.css` extension to `text/css` files that lack one
adjust_extension = on

# Use UTF-8 as the default system encoding
local_encoding = UTF-8
remote_encoding = UTF-8

# Ignore `robots.txt` and `<meta name=robots content=nofollow>`
robots = off

# Print the HTTP and FTP server responses
server_response = on

# Disguise as Chrome
user_agent = Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36
samskiter commented 9 years ago

Fair enough, worth noting this issue was also further hidden by a build system that was using curl under the hood to grab files so I was a fair few layers off finding this

Andersos commented 9 years ago

Looks like we will keep the current user-agent. If you agree @samskiter maybe we can close this?

samskiter commented 9 years ago

Hmm if the consensus is not then fine. But I guess it must be a non-standard setup to do this, or the build system (cerbero) I was using wouldn't have expected curl to work OOTB. At first I wasn't even aware curl was being used, so it was a bit of a digging exercise to find it.

samskiter commented 8 years ago

Just hit this issue again while trying to build GStreamer using their cerbero build system again after a few months. The symptom was a tar.xz file that wouldn't extract (because it was actually a html file). Luckily I remembered this issue, but it still took me a little while to dig to find this issue the second time.

jrahmy commented 8 years ago

Well, it is "non-standard" as it's not the default cURL option. Maybe just change it in your local dotfiles if it's causing a lot of problems. I haven't heard anyone else raise this issue here so far, but perhaps a notice about possible problems would be nice.

SimonSchick commented 8 years ago

Ran into a similar problem with the user-agent today, luckily I knew what I had to look for as I customized my fork a little.

It might be worth just aliasing curl maybe curlb or curl-browser.

samskiter commented 8 years ago

Also had to remove:

 # When following a redirect, automatically set the previous URL as referer
referer = ";auto"

for running curl -L http://download.sourceforge.net/libpng/libpng-1.6.18.tar.xz

jrahmy commented 8 years ago

@samskiter As per the first reply, just use \curl -q when dealing with SourceForge links directly. Cerbero should probably be doing this internally themselves. Might be worth filing a bug report over.