This project is an App Engine application for verifying the status of URLs. The application provides a simple API for submitting lists of URLs to be checked and querying their progress.
There are several ways to use this application for checking URLs:
One of the easiest ways to use the link checker for your own purposes, beyond the AdWords Scripts solution, is via Apps Script: The means to deploy and authenticate with the application, as well as examples of how to interact with the API are already available in a sample script.
Make a copy of the template spreadsheet for the AdWords Scripts solution by clicking File > Make a copy.
The Cloud Setup and App Engine Performance sheets will be the ones of use, with the others not relevant outside of AdWords Scripts.
Follow the instructions on the Cloud Setup sheet. Note that if you are intending to interact with other APIs in your Apps Script solution, for example the DoubleClick Search API, then extra scopes should be added in Step 2 on that sheet.
Once all steps on the Cloud Setup sheet are complete, locate the example Apps Script application: Still within the spreadsheet click Tools > Script editor and locate the Example.gs script.
Here you will see, within the main
function, calls to listOperations
,
createOperation
, getOperation
and deleteOperation
respectively. This
is preceded by the necessary setup to get authentication up and running with
the application.
Using these examples, combined with the API reference, it is possible to quickly develop a custom script to work with the link checker application.
Tune the settings for the application, such as the number of checks to be performed in parallel, by configuring the App Engine Performance sheet.
If you are using a custom client outside of Apps Script, you may wish first to use the spreadsheet described in the above sections to ease the deployment of the application. This is highly recommended because:
It is possible to deploy the application yourself by other means, using the App Engine Admin API, and the App Engine API to configure the Task Queue and cron. If you require this, then the best option is to examine the source code of the spreadsheet.
In order to interact with the linkchecker API, it is necessary to obtain the shared key which is required with all API calls. This approach was chosen for its simplicity when working within AdWords Scripts.
The shared key is stored in Google Datastore, which can both be accessed by the App Engine application, and by any client using the Datastore API.
The shared key should then be set in the HTTP authorization header: e.g:
Authorization: <your_shared_key>
Apps Script: As shown in the CloudSetup.gs file, in the
getSharedKey_
function.
Java: Using the Datastore client library:
public String getSharedKey() {
String projectId = "<your_project_id>";
Datastore datastore = DatastoreOptions
.newBuilder()
.setProjectId(projectId)
.build()
.getService();
String kind = "SharedKey";
String name = "key";
Key taskKey = datastore.newKeyFactory().setKind(kind).newKey(name);
Entity retrieved = datastore.get(taskKey);
return retrieved.getString("key");
}
def get_shared_key():
datastore_client = datastore.Client(project="<your_project_id>")
kind = "SharedKey"
name = "key"
task_key = datastore_client.key(kind, name)
entry = datastore_client.get(task_key)
return entry["key"]
The application provides an API with the following methods. All operational methods are relative to the account base URL of:
https://<project-id>.appspot.com/_ah/api/batchLinkChecker/v1/account/<account-id>/
where:
project-id
is the project ID taken from Google Cloud console.account-id
is a variable provided to allow a single instance of the App
Engine application to be used by multiple sources. It can be any numeric
value.Method | HTTP request | Description |
---|---|---|
Add | POST [account_base_url]/batchOperation |
Submits a batch of URLs to be processed. |
List | GET [account_base_url]/batchOperation |
Retrieves a list of current batches and their status. |
Get | GET [account_base_url]/batchOperation/[id] |
Retrieves results for a specified batch operation. |
Delete | DELETE [account_base_url]/batchOperation/[id] |
Deletes results for a specific operation. |
Furthermore, the API provides methods for retrieving and modifying settings for the linkchecker. All methods are relative to the application base URL of:
https://<project-id>.appspot.com/_ah/api/batchLinkChecker/v1
Method | HTTP request | Description |
---|---|---|
Get settings | GET [app_base_url]/settings |
Retrieve user-modifiable settings. |
Update settings | PUT [app_base_url]/settings |
Update user-modifiable settings. |
POST https://<project-id>.appspot.com/_ah/api/batchLinkChecker/v1/account/<account-id>/batchOperation`
The shared key must be provided in the Authorization
header
The request body should be in JSON format.
Property | Value | Required | Description | |
---|---|---|---|---|
urls[] |
list |
Yes | A list of URL strings for checking, with a maximum of 15000. | : |
failureMatchTexts[] |
list |
No | A list of strings e.g. "Out of Office" that also constitute a failure. | : |
{
"items": [
string
]
}
Property | Value | Description |
---|---|---|
items[] |
list |
A list with one entry, the ID of the job |
GET https://<project-id>.appspot.com/_ah/api/batchLinkChecker/v1/account/<account-id>/batchOperation`
The shared key must be provided in the Authorization
header
The request body should be empty
{
"items": [
BatchOperation
]
}
where BatchOperation
is the following structure:
{
"createdDate": datetime,
"batchId": string,
"status": string
}
Property | Value | Description |
---|---|---|
createdDate |
datetime |
The date and time of job creation (RFC 3339). |
batchId |
string |
The ID of the job |
status |
string |
Valid responses are COMPLETE or PROCESSING |
GET https://<project-id>.appspot.com/_ah/api/batchLinkChecker/v1/account/<account-id>/batchOperation/<id>`
The shared key must be provided in the Authorization
header
Parameter | Value | Description |
---|---|---|
id |
string |
The ID of the job to retrieve results for. |
The request body should be empty
{
"errors": [
BatchOperationError
],
"status": string,
"batchId": string,
"checkedUrlCount": integer
}
Property | Value | Required | Description |
---|---|---|---|
errors[] |
BatchOperationError |
No | If errors were encountered, will be present as a list of BatchOperationError objects. |
batchId |
string |
Yes | The ID of the job |
status |
string |
Yes | Valid responses are COMPLETE or PROCESSING . |
checkedUrlCount |
integer |
Yes | If the job is complete, contains the total number of URLs checked, otherwise is zero. |
where BatchOperationError
is the following structure:
{
"url": string,
"message": string
}
DELETE https://<project-id>.appspot.com/_ah/api/batchLinkChecker/v1/account/<account-id>/batchOperation/<id>`
The shared key must be provided in the Authorization
header
Parameter | Value | Description |
---|---|---|
id |
string |
The ID of the job to delete. |
The request body should be empty
The response is empty
GET https://<project-id>.appspot.com/_ah/api/batchLinkChecker/v1/settings
The shared key must be provided in the Authorization
header
{
"rateInChecksPerMinute": integer,
"userAgentString": string
}
Property | Value | Description |
---|---|---|
rateInChecksPerMinute |
integer |
The number of URLs to check per minute per parallel worker. |
userAgentString |
string |
The User-Agent to use with each request. |
PUT https://<project-id>.appspot.com/_ah/api/batchLinkChecker/v1/settings
The shared key must be provided in the Authorization
header
{
"rateInChecksPerMinute": integer,
"userAgentString": string
}
Property | Value | Required | Description |
---|---|---|---|
rateInChecksPerMinute |
integer |
No | The number of URLs to check per minute per parallel worker. |
userAgentString |
string |
No | The User-Agent to use with each request. |
The response is the new settings, if updated, as per the Get Settings request.
You will need Maven to build this application.
To build the application, having clone the github repository:
mvn package
If too many parallel tasks are enabled on the App Engine application, or too high a rate of checking per task is allowed, then where many URLs belong to the same domain, there exists a risk that requests will be blocked, owing to the high volume of traffic resembling a Denial of Service attack.
There are two settings that are relevant in controlling performance:
Number of parallel tasks: URLs are checked using tasks within an App Engine Task Queue. The number of parallel tasks can be configured using the App Engine API. The App Engine Performance sheet in the template spreadsheet utilizes this API to allow the user to set the number of tasks.
If you wish to modify these settings within a custom client, then the format
of the API request can be seen in the updateTaskQueue
function within
CloudSetup.gs.
Using the two in conjunction allow an appropriate rate of URL checking to be achieved.