Open anjackson opened 4 years ago
@anjackson, yes please.
I've updated the ticket description to reflect the fact that regular expressions are a poor way to deal with this problem, and the URL should be properly parsed to determine the final scope.
Noting that as per Twitter docs, usernames match [0-9a-zA-Z_]+
, and if the Twitter URL is in the basic form https://twitter.com/username
then W3ACT could just add in the trailing slash automatically.
Many Twitter URLs have been added with no trailing slash, which scopes in all of Twitter. We do not want this, so need to refuse to accept them. The downstream
python-w3act
code has been modified to drop them.The W3ACT editor should also reject this kind of URL. We should also consider doing the same for other platforms (Facebook, others?).
NOTE that this scoping issue applies to the URL path, and any query parameters should be dropped before determining the scope. e.g. we see
https://www.twitter.com/name?lang=en
in seeds.