ukwa / w3act

w3act is an annotation and curation tool for building web archive collections
Apache License 2.0
19 stars 6 forks source link

Reject 'bare' Twitter URLs #645

Open anjackson opened 4 years ago

anjackson commented 4 years ago

Many Twitter URLs have been added with no trailing slash, which scopes in all of Twitter. We do not want this, so need to refuse to accept them. The downstream python-w3act code has been modified to drop them.

The W3ACT editor should also reject this kind of URL. We should also consider doing the same for other platforms (Facebook, others?).

NOTE that this scoping issue applies to the URL path, and any query parameters should be dropped before determining the scope. e.g. we see https://www.twitter.com/name?lang=en in seeds.

nicolabingham commented 4 years ago

@anjackson, yes please.

anjackson commented 4 years ago

I've updated the ticket description to reflect the fact that regular expressions are a poor way to deal with this problem, and the URL should be properly parsed to determine the final scope.

anjackson commented 3 years ago

Noting that as per Twitter docs, usernames match [0-9a-zA-Z_]+, and if the Twitter URL is in the basic form https://twitter.com/username then W3ACT could just add in the trailing slash automatically.