Open shawnmjones opened 4 years ago
Thanks for providing details about the problem.
Do you have any suggestion for how the user can provide headers? For example:
archivenow http://www.example.com --header='{"User-Agent": "Mozilla/5.0 (Windows NT 6.1)", "Accept-Charset": "utf-8"}'
The user-agent
is hard coded in the Internet Archive handler (i.e., archivenow/archivenow/handlers/ia_handler.py
) for now.
@maturban MemGator has some logic of allowing users to specify user-agent through the command-line. I think simply allowing a string with some semantic CLI flag (e.g., MemGator's --agent
/-a
) would make specifying this value more straightforward to users.
@ibnesayeed might have an opinion on this as well.
Here are my suggestions after thinking about it this morning.
For the command line utility, something like this should suffice:
archivenow http://www.example.com --user-agent "mytool/1.0"
for comparison, wget has -U, --user-agent=AGENT
and curl has -A, --user-agent <name>
.
We don't have a use case for allowing command-line users to change all request headers, just user-agent.
Programmers, on the other hand, may need to modify request headers. This is why I was suggesting that we alter def push
in archivenow/archivenow.py
to be something more like:
def push(URI, arc_id, p_args={}, headers={}):
and have the headers
dictionary propagate to the appropriate handler and the request.get
call that it makes.
I have an even better idea.
Because ArchiveNow employs the requests library, you could allow the programmer to set up a session object and send the session object as an argument.
If no session object is specified, the argument can default to a new one. Like this:
def push(URI, arc_id, p_args={}, session=requests.Session()):
This way, the programmer can set up the session once in their own code and just pass it. They may have changed the session object to include caching, timeouts, user-agents, request headers, etc, and ArchiveNow does not need to care what changes were made. It just calls session.get
when the time comes.
You can even re-use this session object solution when changing the user-agent string while adding the user-agent argument for command-line users.
MemGator CLI's user agent works as following:
User-Agent
header in the MemGator/{Version} <{CONTACT}>
format where the value of the {Version}
is the version of the MG binary used and the default value of {CONTACT}
is set to the URI of the MG repo.--contact
CLI parameter. This value will be placed in the default UA template.--agent
CLI parameter.--spoof
flag, which will cause MG to use a random browser UA in each request. There are currently only three spoofing agents defined in the repo, but they can be expanded to have more to choose from.I would not suggest supplying python dictionaries from the CLI as CLIs should be language independent.
If you want to allow specifying generic request headers from the CLI (apart from a dedicated flag for the UA), you can use the append
action (which will allow repetition of the same argument) to a CLI parameter like --header
which accepts a value of the form "header-name: value"
(this is how it is done in cURL).
As far as the internal API is concerned, I would certainly suggest taking @shawnmjones' advice on supporting custom session object. In addition to that, I would suggest you use wildcard keyword arguments (that start with **
) to ensure the API signature does not change each time you include support for one more feature and it also allows forwarding arguments to internal function calls. Depending on the situation, you may introduce some sort of convention in the argument names to group them automatically (e.g., all the arguments received due to the wildcard **kwarg
may have certain prefixes to to be treated one way or the other).
I noticed this when I was using ArchiveNow this morning.
If I add a user agent to the arguments to the
requests.get
on line 15 ofarchivenow/archivenow/handlers/ia_handler.py
then it works.https://github.com/oduwsdl/archivenow/blob/cafcbddca7717dba70bffc1982fabffcdbbd912f/archivenow/handlers/ia_handler.py#L15
I'm uncertain as to how you want to handle the user specifying their own user agent. The existing
--agent
argument appears to be for specifying which tool the user desires to employ for creating WARCs. Also, there doesn't appear to be a way to submit changes to any of the request headers inarchivenow/archivenow.py
.As I'm calling ArchiveNow within Python code, I would prefer an available parameter to the
push
function on line 129 ofarchivenow/archivenow.py
.https://github.com/oduwsdl/archivenow/blob/cafcbddca7717dba70bffc1982fabffcdbbd912f/archivenow/archivenow.py#L129-L168
For example, we could have:
where the user can override any of the request headers by assigning them as a dictionary to the
headers
parameter. This dictionary would have to be re-submitted through the code on line 154 to the function executed via multithreading.I haven't submitted a pull request yet because all handlers would need to be updated to receive and act on this parameter. I'm not sure of the implications of that.