oduwsdl / archivenow

A Tool To Push Web Resources Into Web Archives
MIT License
406 stars 42 forks source link

Problems with pushing mementos into Internet Archive #43

Open shawnmjones opened 4 years ago

shawnmjones commented 4 years ago

I noticed this when I was using ArchiveNow this morning.

# archivenow www.foxnews.com
Error (The Internet Archive): 445 Client Error:  for url: https://web.archive.org/save/www.foxnews.com

If I add a user agent to the arguments to the requests.get on line 15 of archivenow/archivenow/handlers/ia_handler.py then it works.

https://github.com/oduwsdl/archivenow/blob/cafcbddca7717dba70bffc1982fabffcdbbd912f/archivenow/handlers/ia_handler.py#L15

I'm uncertain as to how you want to handle the user specifying their own user agent. The existing --agent argument appears to be for specifying which tool the user desires to employ for creating WARCs. Also, there doesn't appear to be a way to submit changes to any of the request headers in archivenow/archivenow.py.

As I'm calling ArchiveNow within Python code, I would prefer an available parameter to the push function on line 129 of archivenow/archivenow.py.

https://github.com/oduwsdl/archivenow/blob/cafcbddca7717dba70bffc1982fabffcdbbd912f/archivenow/archivenow.py#L129-L168

For example, we could have:

def push(URI, arc_id, p_args={}, headers={}):

where the user can override any of the request headers by assigning them as a dictionary to the headers parameter. This dictionary would have to be re-submitted through the code on line 154 to the function executed via multithreading.

I haven't submitted a pull request yet because all handlers would need to be updated to receive and act on this parameter. I'm not sure of the implications of that.

maturban commented 4 years ago

Thanks for providing details about the problem.

Do you have any suggestion for how the user can provide headers? For example:

archivenow http://www.example.com --header='{"User-Agent": "Mozilla/5.0 (Windows NT 6.1)", "Accept-Charset": "utf-8"}'

maturban commented 4 years ago

The user-agent is hard coded in the Internet Archive handler (i.e., archivenow/archivenow/handlers/ia_handler.py) for now.

machawk1 commented 4 years ago

@maturban MemGator has some logic of allowing users to specify user-agent through the command-line. I think simply allowing a string with some semantic CLI flag (e.g., MemGator's --agent/-a) would make specifying this value more straightforward to users.

@ibnesayeed might have an opinion on this as well.

shawnmjones commented 4 years ago

Here are my suggestions after thinking about it this morning.

For command-line users

For the command line utility, something like this should suffice:

archivenow http://www.example.com --user-agent "mytool/1.0"

for comparison, wget has -U, --user-agent=AGENT and curl has -A, --user-agent <name>.

We don't have a use case for allowing command-line users to change all request headers, just user-agent.

For programmers (me, other WS-DL folks, and the world)

Programmers, on the other hand, may need to modify request headers. This is why I was suggesting that we alter def push in archivenow/archivenow.py to be something more like:

def push(URI, arc_id, p_args={}, headers={}):

and have the headers dictionary propagate to the appropriate handler and the request.get call that it makes.

I have an even better idea.

Because ArchiveNow employs the requests library, you could allow the programmer to set up a session object and send the session object as an argument.

If no session object is specified, the argument can default to a new one. Like this:

def push(URI, arc_id, p_args={}, session=requests.Session()):

This way, the programmer can set up the session once in their own code and just pass it. They may have changed the session object to include caching, timeouts, user-agents, request headers, etc, and ArchiveNow does not need to care what changes were made. It just calls session.get when the time comes.

You can even re-use this session object solution when changing the user-agent string while adding the user-agent argument for command-line users.

ibnesayeed commented 4 years ago

MemGator CLI's user agent works as following:

ibnesayeed commented 4 years ago

I would not suggest supplying python dictionaries from the CLI as CLIs should be language independent.

If you want to allow specifying generic request headers from the CLI (apart from a dedicated flag for the UA), you can use the append action (which will allow repetition of the same argument) to a CLI parameter like --header which accepts a value of the form "header-name: value" (this is how it is done in cURL).

As far as the internal API is concerned, I would certainly suggest taking @shawnmjones' advice on supporting custom session object. In addition to that, I would suggest you use wildcard keyword arguments (that start with **) to ensure the API signature does not change each time you include support for one more feature and it also allows forwarding arguments to internal function calls. Depending on the situation, you may introduce some sort of convention in the argument names to group them automatically (e.g., all the arguments received due to the wildcard **kwarg may have certain prefixes to to be treated one way or the other).