Closed mtxadmin closed 3 years ago
I also said:
Anyway, as said I still need more than just one case to be an argument for such filter -- the last thing I want is to add technical debt to uBO for little tangible benefits overall.
I still do not see cases being brought up to justify the feature.
utm_* stuff (more like privacy and tracking)
Plus Aliexpress stm marketing parameters, and some others.
Yes, of course, there are some specialized URL redirecting extensions that can truncate the URL parameters. But, they are all lack subscription feature. If you want to set up blocking those trash parameters from URL on a new device, you have to install a separate extension and after then manual enter all the parameters. And after time, then you discover a new trash parameter and want to get rid of it, you have to add it on EVERY device, even those which have parents, relatives and friends. This is a way of hell to explain to inexperienced users which buttons should be clicked and which values added. Instead of adding all that to neatly autoupdated filterlist.
There are at least two cases in which the implementation of $querystrip
can be used to eliminate pesky SSAI.
How $querystrip
will fix this:
The video players on the below pages request files (xhr/iframe) with some ad-related parameters. If these parameters are removed, the m3u8 no longer has ads baked in server-side.
https://abcotvs.com/index.html
)Example URLS:
https://abc30.com/7285501/
https://6abc.com/7255376/
https://abc7.com/7282142/
Example from https://6abc.com/7255376/
:
JSON requested:
https://content.uplynk.com/api/v3/preplay/external/10b98e7c615f43a98b180d51797e74aa/102320-wpvi-fyi4u-halloween-CC-vid.json?platform=web&ad.tag=fyi-holidays%2Challoween%2Cfyi-philly%2Cfyi-events%2Cphiladelphia%2Ccommunity&ad.vast3=1&ad.v=2&ad.tfcd=0&ad.is_lat=0&ad.npa=0&ad.correlator=49683&ad=wpvi_vod&ad.adUnit=%2Fwpvi%2F6abc.com%2Fweb%2Fcommunity&ad.pp=otv-web-desktop&ad.vid=otv-7275653&ad.description_url=https%3A%2F%2F6abc.com%2F7255376%2F&ad.sz=640x480&ad.ppid=&ad.vpi=1&accountID=10b98e7c615f43a98b180d51797e74aa&externalID=102320-wpvi-fyi4u-halloween-CC-vid&euid=_000_0_001_SF&ad.cust_params=accesslevel%3D0%26beacTyp%3Dssai%26isAuth%3D0%26aff%3Dwpvi%26lang%3Den%26ait%3Dssai%26chan%3Dmisc%26objid%3D7255376%26isAutoplay%3D1%26isDnt%3D0%26isMute%3D0%26pgtyp%3Dpost%26vdm%3Dvod%26fb_token%3D%26plt%3Ddesktop%26stp%3Dvdms%26swid%3D%26dtci_yxbk%3D%26var%3D16x9%26vps%3D640x360%26noad%3D0%26refDomain%3Dhttps%253A%252F%252F6abc.com%252F7255376%252F%26unid%3D
Response: https://gist.github.com/llacb47/b779691ae61947f3d553ab8a8c2e8b3b
If you open the playURL
from that JSON in any video player, it has SSAI.
Possible querystrip filter: ||content.uplynk.com/api/v3/preplay/external/$domain=6abc.com,querystrip=/^ad\./
If the hypothetical resulting JSON is requested: https://content.uplynk.com/api/v3/preplay/external/10b98e7c615f43a98b180d51797e74aa/102320-wpvi-fyi4u-halloween-CC-vid.json?platform=web&accountID=10b98e7c615f43a98b180d51797e74aa&externalID=102320-wpvi-fyi4u-halloween-CC-vid&euid=_000_0_001_SF
Response:
{"prefix": "https://content-ause2.uplynk.com", "ads": {"breaks": [], "breakOffsets": [], "placeholderOffsets": []}, "videoView": [], "sid": "88183284ff9643bdbcd677b38be390fe", "playURL": "https://content-ause2.uplynk.com/api/v3/preplay2/712cb2a3041149c79afa3964b00021ca/447659e71fac930d104a56b5f820da5a/12UMPd9E2kAQYHf9eRchJk6JhLspdMc7aUmFVvNrTz4L.m3u8?pbs=88183284ff9643bdbcd677b38be390fe"}
By opening this playURL
in any video player, it can be seen that there is no preroll.
URLs:
https://www.nbc.com/late-night-with-seth-meyers/video/michael-keaton-haim/4249694
https://www.nbc.com/saturday-night-live/video/october-17-issa-rae/4246222
https://www.nbc.com/dateline/video/a-promise-to-helene/4247160
Example from https://www.nbc.com/dateline/video/a-promise-to-helene/4247160
:
Page requests an iframe at this URL (probably NA geolocked)
https://player.theplatform.com/p/jujdhC/xkaAQrhkr9IU/select/media/guid/2410887629/4247160?mute=false&autoPlay=true&playbackStartPosition=0&policy=147097231&mParticleId=6082210794167379875¶ms=mode%3Don-demand%26uuid%3Db2372c28-9c5d-46a9-b293-866a484b2ec5%26did%3Dedb63874-27fd-1287-1a61-c829619aa2f2%26rdid%3Dedb63874-27fd-1287-1a61-c829619aa2f2%26userAgent%3DMozilla%252F5.0%2520%2528Windows%2520NT%252010.0%253B%2520rv%253A78.0%2529%2520Gecko%252F20100101%2520Firefox%252F78.0%26am_crmid%3D6082210794167379875%26am_playerv%3Dnull%26am_sdkv%3Dnull%26am_appv%3Dnull%26am_buildv%3Dnull%26am_stitcherv%3Dpoc%26uoo%3D0%26am_cpsv%3D4.0.0-2%26fw_ae%3D%26metr%3D1023%26csid%3Dnbc_tveverywhere_vod_hub%26am_extmp%3Ddefault%26am_abvrtd%3D0%26am_abtestid%3D0%26nw%3D169843%26_fw_did%3Db2372c28-9c5d-46a9-b293-866a484b2ec5%26prof%3Dnbcu_web_svp_js_https%26afid%3D136164654%26sfid%3D1676939%26policy%3D147097231%26fallbackSiteSectionId%3D9244655%26siteSectionId%3Doneapp_desktop_computer_web_ondemand%26manifest%3Dm3u%26switch%3DHLSOriginSecure%26_fw_vcid2%3D169843%3A6082210794167379875%26_fw_h_referer%3Dwww.nbc.com%26schema%3D2.0&episodetitle=A%20Promise%20to%20Helene&nbcuProfile=false&brand=NBC&show=Dateline&MVPDid=undefined#playerurl=https%3A//www.nbc.com/dateline/video/a-promise-to-helene/4247160
There are prerolls and midrolls inserted server-side. You can visit the iframe by itself to see them as well as the url provided on the NBC site.
Possible $querystrip
filter: ||player.theplatform.com/p/*/select/media/guid/$subdocument,domain=nbc.com,querystrip='params'
Resulting URL:
https://player.theplatform.com/p/jujdhC/xkaAQrhkr9IU/select/media/guid/2410887629/4247160?mute=false&autoPlay=true&playbackStartPosition=0&policy=147097231&mParticleId=6082210794167379875&episodetitle=A%20Promise%20to%20Helene&nbcuProfile=false&brand=NBC&show=Dateline&MVPDid=undefined#playerurl=https%3A//www.nbc.com/dateline/video/a-promise-to-helene/4247160
No more prerolls or midrolls.
I think this is more than little tangible benefits
, what do you think @gorhill ?
@llacb47 Is there an extension to rewrite URLs which you can use to confirm that removing the query parameters does really remove the ads?
Yes, you can use https://einaregilsson.com/redirector/
to test. I just verified that it works as expected.
You can import the rules I used to test: https://gist.githubusercontent.com/llacb47/23a446ac1cc7763a4574a672420626fb/raw/555966752881f67859205c2ab4c579a61d8ad523/redirect-rules.json
I just found that removing parameters also gets rid of SSAI on Discovery Networks sites as well.
Test with these URLs:
https://go.discovery.com/tv-shows/growing-belushi/full-episodes/a-mission-from-god
https://www.tlc.com/tv-shows/90-day-fiance-the-other-way/full-episodes/ready-or-not
https://watch.hgtv.com/tv-shows/fixer-to-fabulous/full-episodes/dave-jennys-pick-dreary-home-gets-bright-update
https://www.sciencechannel.com/tv-shows/unearthed/full-episodes/secrets-of-the-seven-wonders
https://www.ahctv.com/tv-shows/manhunt-kill-or-capture/full-episodes/whitey-bulger-boston-mob-king
and this rule for redirector: https://gist.githubusercontent.com/llacb47/e0c78ffbb44b203da6c975796e7d6608/raw/d4ed40b979d13d6de4d130fe586cfc98383460b2/discovery-redirector.json
Any update?
It's not trivial to implement with as little impact as possible on performance, so this will have to take the time it takes to implement it.
It's in the latest dev build. See commit message for usage -- I do not want to provide details in release notes yet, I prefer filter list authors to experiment with usage to find out if fine tuning is necessary.
Usage example - ||reddit.com^$queryprune=utm_
, ||youtube.com^$queryprune=fbclid|gclid
Avoiding queryprune
from being visited at all is best, I do hope filter authors will be as carefully as possible when crafting queryprune
filters as I am careful at minimizing all overhead in the code -- otherwise all the coding efforts are going to waste. So typically the query parameter of interest will be part of the filter pattern:
||reddit.com^*utm_$queryprune=|utm_
||youtube.com^*fbclid$queryprune=fbclid
||youtube.com^*gclid$queryprune=gclid
This way uBO will scan the query parameters only when the URL is found to match the targeted query parameters. Mind performance when crafting filters. Your proposed filters forces uBO to scan every URL matching reddit.com
and youtube.com
.
Additionally, prepending queryprune
values with |
when the match is of the "starts with" kind also helps.
This comparision table from ClearURLs Wiki might be useful as well.
To be clear, the purpose of queryprune
is not to replace URL cleaners, so it shouldn't be compared to these -- its purpose is only to remove query parameters, not to rewrite URLs, at most uBO's queryprune
seems to match what Neat URL does, nothing more.
@gorhill And if a parameter contains the destination url, is it possible to keep that parameter and remove the url with no parameter?
You mean to just remove the parameter value while keeping the parameter name? It's not possible the way I implemented it. Any real use cases demonstrating that an empty parameter value would be useful?
This for example
https://ouo.io/s/BulJXu78?s=http://hackstore.link/alg2i
The destination url is http://hackstore.link/alg2i
I remember there were some warez sites, which force you to go through a shortener, where you have to click on an ad to continue, but I can't find an example link.
This for example
So you want https://ouo.io/s/BulJXu78?s=http://hackstore.link/alg2i
to become https://ouo.io/s/BulJXu78?s=
?
https://ouo.io/s/BulJXu78?s=http://hackstore.link/alg2i
to become -> http://hackstore.link/alg2i
Well in that case it's not a queryprune
-related issue, it's in the realm of a URL cleaner, beyond uBO's purpose (well, for the time being, I can imagine in some future a queryjump
sort of option, but this feels like feature creep which would bring further requests and so on).
The motivation for the current queryprune
was definitely that this solves the video ads issues reported by @llacb47 above. (confirmation that this works would be welcome)
One issue with a queryjump
sort of filter option is that a filter list author could cause uBO to be redirected to a URL which was not meant to be visited -- URLs encoded as query values should not be automatically interpreted as URL meant to be visited. Because of this, this is something best left to a dedicated extension, to not end up with a case of rewrite=
option in uBO.
I can imagine in some future a
queryjump
sort of option, but this feels like feature creep which would bring further requests and so on).
It is not necessary, I would prefer before, a way to defuse the mechanism that some sites use, to force you to click on an ad to continue, but uBO blocks all those ads, so you are simply stuck, but this is another topic.
Does it make sense to use separator placeholder (^
) to math the boundary of the parameter in the filter matching part?
And how about using =
in filter or option to mark boundary on the other side?
! .com/?utm_anything=
! .com/?notutm_u=&utm_anything=
||reddit.com^*^utm_*=$queryprune=|utm_
! &fbclid=
||youtube.com^*^fbclid=$queryprune=|fbclid=
WARNING! Using ...^*^..
is not optimal - https://github.com/uBlockOrigin/uBlock-issues/issues/760#issuecomment-720406961
Filters:
||example.com^*^ga_*=$queryprune=|ga_
||example.com^*^utm_*=$queryprune=|utm_
Address:
http://example.com/?ga_asdf=2&utm_asdf=1
Result:
The page isn’t redirecting properly
Firefox has detected that the server is redirecting the request for this address in a way that will never complete.
1.30.9b3
WARNING! Using ...^*^..
is not optimal - https://github.com/uBlockOrigin/uBlock-issues/issues/760#issuecomment-720406961
More than one wildcard cause pattern matching to fallback to regex-based filter. Best to limit occurrences of *
and ^
.
to mark boundary on the other side?
I dont understand what you have in mind.
To match query parameter name as exact as possible it may be required to create two filters ...*?utm_*
and ...*&utm_*
to match start of the name and for the end of the name it will be equal sign character - ...utm_campaign=
.
match start of the name and for the end of the name it will be equal sign character
The first worry is to avoid *
and ^
as much as possible so as to not cause the pattern matching to require an actual regex instance. The second worry is to be sure that the filter is not visited at all (use 3p
, 1p
, script
, etc), and that if it is visited then that the pattern matching is not going to cause a pointless visit to the query-pruning code. With that in mind, these:
||example.com^*^ga_*=$queryprune=|ga_
||example.com^*^utm_*=$queryprune=|utm_
Would be better as:
||example.com^*ga_$queryprune=|ga_
||example.com^*utm_$queryprune=|utm_
The first form requires the pattern matching to be done with FilterPatternGeneric, a regex-based filter -- and possibly a inefficient one given there are two wildcards per filter.
The second form is internally optimized into FilterPatternRightEx (because ^*
), a non-regex-based filter. And with the second form, really what are the chance of a matching URL containing an instance of ga_
/utm_
and not having ga_
/utm_
query parameters to prune? Unlikely, so no point to worry needlessly.
However, I may implement this dev cycle an old idea proposed a long time ago to let the filter author explicitly declare the token to use for a given filter, rather than let the filtering engine pick one. In that case, the filters could be written as:
||example.com^*ga_$queryprune=|ga_,token=ga
||example.com^*utm_$queryprune=|utm_,token=utm
Built-in tokenizer would discard ga
or utm
as token in above filters because of preceding *
, while a filter author knows that what precede those two segments is never a token class character.
Also, the guidelines above apply to any sort of filters, not specific to queryprune
-- it's just with modifier filters such as (csp=
, queryprune=
) all hits need to be collated, while with non-modifier filters only the first hit is returned.
||www.reddit.com/*utm_$doc,1p,queryprune=|utm_
How's that one ? As per the logger, it removes all the utm_
queries in three tries instead of one, not sure what changed.
it removes all the utm_ queries in three tries instead of one
I could probably answer properly if there was a URL provided for me to visit and reproduce.
Just navigate to https://www.reddit.com/r/random/
with the logger open.
I could not reproduce with 1.30.9b4, so probably this was the same issue as https://github.com/uBlockOrigin/uBlock-issues/issues/760#issuecomment-720140135.
Actually, I don't see three tries but then I don't see all the utm_
parameters being removed...
Like that ?
Turns out that despite URLSearchParams looking a whole lot like Map objects, they don't behave like Map object when iterating through them. I will replace the last dev build to fix the failure to remove query parameters.
Was working fine yesterday though...
Was working fine yesterday though...
Why this comment? b3 was not using URLSearchParams() -- it's all in the commit history.
Why this comment?
Because I was wondering what changed since yesterday as today no matter what pattern I tried, it always took three tries, so I was about to give up on this but decided to ask here about the pattern believing something was wrong with that.
You fell into the same trap I did? https://github.com/Smile4ever/Neat-URL/commit/c21d7e159e90ff7e0f6f07058f67b7be52a54101
queryprune
matching is case-sensitive. (Just a note)
Turns out that despite URLSearchParams looking a whole lot like Map objects, they don't behave like Map object when iterating through them. I will replace the last dev build to fix the failure to remove query parameters.
Funny that I just ran across this accidentally. I am working on a new extension for stripping tracking parameters and ran into what I assume is the same issue with the iteration behavior.
let testURL = new URL('https://www.facebook.com/user/test/?__cft__[0]=cft0value&size=sizevalue&__tn__=tnvalue&product=productvalue&__cft__[1]=cft1value&fbclid=fbclidvalue&utm_source=ihatetrackers&color=colorvalue&newuser');
console.log(testURL.search);
// ?__cft__[0]=cft0value&size=sizevalue&__tn__=tnvalue&product=productvalue&__cft__[1]=cft1value&fbclid=fbclidvalue&utm_source=ihatetrackers&color=colorvalue&newuser
testURL.searchParams.forEach((v,k)=>{testURL.searchParams.delete(k);});
console.log(testURL.search);
// ?size=sizevalue&product=productvalue&fbclid=fbclidvalue&color=colorvalue
testURL.searchParams.forEach((v,k)=>{testURL.searchParams.delete(k);});
console.log(testURL.search);
// ?product=productvalue&color=colorvalue
testURL.searchParams.forEach((v,k)=>{testURL.searchParams.delete(k);});
console.log(testURL.search);
// ?color=colorvalue
testURL.searchParams.forEach((v,k)=>{testURL.searchParams.delete(k);});
console.log(testURL.search);
// ""
Yet forEach does clearly iterate over them...
let rslts=[]; testURL.searchParams.forEach((v,k)=>rslts.push(k)); console.log(rslts);
// ["__cft__[0]", "size", "__tn__", "product", "__cft__[1]", "fbclid", "utm_source", "color", "newuser"]
Interestingly, searchParams.keys() into an array and calling .delete() on all the keys, works perfectly. Haven't dug into why.
All of the above I'm sure you already figured out... Just chiming in since you were concerned about performance.
I actually benchmarked a ton of different routes and surprisingly found that a series of chained replace statements (WITH regex even (not just string replaces)) was 220% faster than anything I could come up with using searchParams. (latest stable chrome on W10).
This will definitely be a great addition to uBO!
Added as an experiment: https://github.com/easylist/ruadlist/commit/43e1157bf9c3eb68c007208e5df3114c88259182
And I found that this rule works in different ways in Google Chrome and Firefox.
Pay attention to the contents of the address bar.
Gif rambler Chrome Gif rambler FF
If you open any material, and then click on the Rambler logo, to return to the main page, then when using Google Chrome, the ending will appear in the address bar: ?utm_source=weekend_media&utm_campaign=self_promo&utm_medium=logo&utm_content=head
In addition, this ending sometimes appears after the transition to separate materials. When using Firefox - the address bar will always contain only the address.
@dimisa-RUAdList If you use the logger, you will see no network request for document matching rambler.ru?utm_
, so what you see is just the page being dynamically updated and the URL address is just updated, there is no page load -- so no network request for document
. I did see network requests matching rambler.ru?utm_
, but these were xhr
, which your filter doesn't match.
https://github.com/easylist/ruadlist/commit/74f6fd30707d34344aa6d35403b5a267ae60b221
Nothing changed. In Google Chrome, the ending still appears, in Firefox, only the address itself is still in the address bar.
Is the document
request occuring ? If not, then it's because of service-worker
.
In Google Chrome, the ending still appears
As said, use uBO's logger, you won't see network requests for document
. When the logger shows network request to document
, it is cleaned by queryprune
. In doubt, you can also use the browser's dev tools to see that no network requests with utm_
stuff is being made.
Hmm I do see uBO's logger cleaning up utm_
parameters with ||rambler.ru^*utm_$queryprune=|utm_
, but the browser dev tools still show document network requests with utm_
parameters.
Yeah ok sorry @dimisa-RUAdList, you are right, there is an issue, the querypruning does not occur for tabless requests.
Super! Now in Google Chrome everything is fine!
||example.com^*^utm_*=$queryprune=|utm_
I am really second-guessing making this option available to inexperienced people, the issue here shows that the person went directly with the filter above, ignoring everything that followed in the discussion to avoid that kind of filters...
Prerequisites
Description
In https://github.com/uBlockOrigin/uBlock-issues/issues/46 about $rewrite parameter author said:
Will the $querystrip parameter be realized?
A specific URL where the issue occurs
[A specific URL is MANDATORY for issue happening on a web page, even if it happens "everywhere"]
Steps to Reproduce
Expected behavior:
[What you expected to happen]
Actual behavior:
[What actually happened]
Your environment