Implement `$queryprune` parameter

mtxadmin commented 5 years ago

Prerequisites

[x] I verified that this is not a filter issue
- Filter issues MUST be reported at filter issue tracker
[x] This is not a support issue or a question
- Support issues and questions are handled at /r/uBlockOrigin
[x] I performed a cursory search of the issue tracker to avoid opening a duplicate issue
- Your issue may already be reported.
I tried to reproduce the issue when...
- [x] uBlock Origin is the only extension
- [x] uBlock Origin with default lists/settings
- [x] using a new, unmodified browser profile
[x] I am running the latest version of uBlock Origin
[x] I checked the documentation to understand that the issue I report is not a normal behavior

Description

In https://github.com/uBlockOrigin/uBlock-issues/issues/46 about $rewrite parameter author said:

I won't implementing this filter option, I see too many issues with it. I am however open to implement a different filter option with similar purpose, but which would not suffer the issues I see with how rewrite has been designed. ... I see a better way to implement similar option but with a more focused purpose: to remove specific query parameters from a URL:

||content.uplynk.com/ext/.m3u8?$querystrip=

Where the querystrip option would mean: "remove all query parameters matching the given lists of tokens or pattern".

Will the $querystrip parameter be realized?

A specific URL where the issue occurs

[A specific URL is MANDATORY for issue happening on a web page, even if it happens "everywhere"]

Steps to Reproduce

[First Step]
[Second Step]
[and so on...]

Expected behavior:

[What you expected to happen]

Actual behavior:

[What actually happened]

Your environment

uBlock Origin version:
Browser Name and version:
Operating System and version:

gorhill commented 5 years ago

I also said:

Anyway, as said I still need more than just one case to be an argument for such filter -- the last thing I want is to add technical debt to uBO for little tangible benefits overall.

I still do not see cases being brought up to justify the feature.

mtxadmin commented 5 years ago

utm_* stuff (more like privacy and tracking)

Plus Aliexpress stm marketing parameters, and some others.

Yes, of course, there are some specialized URL redirecting extensions that can truncate the URL parameters. But, they are all lack subscription feature. If you want to set up blocking those trash parameters from URL on a new device, you have to install a separate extension and after then manual enter all the parameters. And after time, then you discover a new trash parameter and want to get rid of it, you have to add it on EVERY device, even those which have parents, relatives and friends. This is a way of hell to explain to inexperienced users which buttons should be clicked and which values added. Instead of adding all that to neatly autoupdated filterlist.

liamengland1 commented 4 years ago

There are at least two cases in which the implementation of $querystrip can be used to eliminate pesky SSAI.

How $querystrip will fix this: The video players on the below pages request files (xhr/iframe) with some ad-related parameters. If these parameters are removed, the m3u8 no longer has ads baked in server-side.

Case 1: videos on ABC-owned TV station websites (`https://abcotvs.com/index.html`)

Example URLS:

https://abc30.com/7285501/
https://6abc.com/7255376/
https://abc7.com/7282142/

Example from https://6abc.com/7255376/:

JSON requested:

https://content.uplynk.com/api/v3/preplay/external/10b98e7c615f43a98b180d51797e74aa/102320-wpvi-fyi4u-halloween-CC-vid.json?platform=web&ad.tag=fyi-holidays%2Challoween%2Cfyi-philly%2Cfyi-events%2Cphiladelphia%2Ccommunity&ad.vast3=1&ad.v=2&ad.tfcd=0&ad.is_lat=0&ad.npa=0&ad.correlator=49683&ad=wpvi_vod&ad.adUnit=%2Fwpvi%2F6abc.com%2Fweb%2Fcommunity&ad.pp=otv-web-desktop&ad.vid=otv-7275653&ad.description_url=https%3A%2F%2F6abc.com%2F7255376%2F&ad.sz=640x480&ad.ppid=&ad.vpi=1&accountID=10b98e7c615f43a98b180d51797e74aa&externalID=102320-wpvi-fyi4u-halloween-CC-vid&euid=_000_0_001_SF&ad.cust_params=accesslevel%3D0%26beacTyp%3Dssai%26isAuth%3D0%26aff%3Dwpvi%26lang%3Den%26ait%3Dssai%26chan%3Dmisc%26objid%3D7255376%26isAutoplay%3D1%26isDnt%3D0%26isMute%3D0%26pgtyp%3Dpost%26vdm%3Dvod%26fb_token%3D%26plt%3Ddesktop%26stp%3Dvdms%26swid%3D%26dtci_yxbk%3D%26var%3D16x9%26vps%3D640x360%26noad%3D0%26refDomain%3Dhttps%253A%252F%252F6abc.com%252F7255376%252F%26unid%3D

Response: https://gist.github.com/llacb47/b779691ae61947f3d553ab8a8c2e8b3b If you open the playURL from that JSON in any video player, it has SSAI.

Possible querystrip filter: ||content.uplynk.com/api/v3/preplay/external/$domain=6abc.com,querystrip=/^ad\./

If the hypothetical resulting JSON is requested: https://content.uplynk.com/api/v3/preplay/external/10b98e7c615f43a98b180d51797e74aa/102320-wpvi-fyi4u-halloween-CC-vid.json?platform=web&accountID=10b98e7c615f43a98b180d51797e74aa&externalID=102320-wpvi-fyi4u-halloween-CC-vid&euid=_000_0_001_SF

Response:

{"prefix": "https://content-ause2.uplynk.com", "ads": {"breaks": [], "breakOffsets": [], "placeholderOffsets": []}, "videoView": [], "sid": "88183284ff9643bdbcd677b38be390fe", "playURL": "https://content-ause2.uplynk.com/api/v3/preplay2/712cb2a3041149c79afa3964b00021ca/447659e71fac930d104a56b5f820da5a/12UMPd9E2kAQYHf9eRchJk6JhLspdMc7aUmFVvNrTz4L.m3u8?pbs=88183284ff9643bdbcd677b38be390fe"}

By opening this playURL in any video player, it can be seen that there is no preroll.

Case 2: NBC videos

URLs:

https://www.nbc.com/late-night-with-seth-meyers/video/michael-keaton-haim/4249694
https://www.nbc.com/saturday-night-live/video/october-17-issa-rae/4246222
https://www.nbc.com/dateline/video/a-promise-to-helene/4247160

Example from https://www.nbc.com/dateline/video/a-promise-to-helene/4247160:

Page requests an iframe at this URL (probably NA geolocked)

https://player.theplatform.com/p/jujdhC/xkaAQrhkr9IU/select/media/guid/2410887629/4247160?mute=false&autoPlay=true&playbackStartPosition=0&policy=147097231&mParticleId=6082210794167379875&params=mode%3Don-demand%26uuid%3Db2372c28-9c5d-46a9-b293-866a484b2ec5%26did%3Dedb63874-27fd-1287-1a61-c829619aa2f2%26rdid%3Dedb63874-27fd-1287-1a61-c829619aa2f2%26userAgent%3DMozilla%252F5.0%2520%2528Windows%2520NT%252010.0%253B%2520rv%253A78.0%2529%2520Gecko%252F20100101%2520Firefox%252F78.0%26am_crmid%3D6082210794167379875%26am_playerv%3Dnull%26am_sdkv%3Dnull%26am_appv%3Dnull%26am_buildv%3Dnull%26am_stitcherv%3Dpoc%26uoo%3D0%26am_cpsv%3D4.0.0-2%26fw_ae%3D%26metr%3D1023%26csid%3Dnbc_tveverywhere_vod_hub%26am_extmp%3Ddefault%26am_abvrtd%3D0%26am_abtestid%3D0%26nw%3D169843%26_fw_did%3Db2372c28-9c5d-46a9-b293-866a484b2ec5%26prof%3Dnbcu_web_svp_js_https%26afid%3D136164654%26sfid%3D1676939%26policy%3D147097231%26fallbackSiteSectionId%3D9244655%26siteSectionId%3Doneapp_desktop_computer_web_ondemand%26manifest%3Dm3u%26switch%3DHLSOriginSecure%26_fw_vcid2%3D169843%3A6082210794167379875%26_fw_h_referer%3Dwww.nbc.com%26schema%3D2.0&episodetitle=A%20Promise%20to%20Helene&nbcuProfile=false&brand=NBC&show=Dateline&MVPDid=undefined#playerurl=https%3A//www.nbc.com/dateline/video/a-promise-to-helene/4247160

There are prerolls and midrolls inserted server-side. You can visit the iframe by itself to see them as well as the url provided on the NBC site.

Possible $querystrip filter: ||player.theplatform.com/p/*/select/media/guid/$subdocument,domain=nbc.com,querystrip='params'

Resulting URL:

https://player.theplatform.com/p/jujdhC/xkaAQrhkr9IU/select/media/guid/2410887629/4247160?mute=false&autoPlay=true&playbackStartPosition=0&policy=147097231&mParticleId=6082210794167379875&episodetitle=A%20Promise%20to%20Helene&nbcuProfile=false&brand=NBC&show=Dateline&MVPDid=undefined#playerurl=https%3A//www.nbc.com/dateline/video/a-promise-to-helene/4247160

No more prerolls or midrolls.

I think this is more than little tangible benefits, what do you think @gorhill ?

gorhill commented 4 years ago

@llacb47 Is there an extension to rewrite URLs which you can use to confirm that removing the query parameters does really remove the ads?

mapx- commented 4 years ago

maybe https://addons.mozilla.org/en-US/firefox/addon/requestcontrol/

https://github.com/tumpio/requestcontrol/blob/master/_locales/en/manual.wiki#redirect-using-pattern-capturing

liamengland1 commented 4 years ago

Yes, you can use https://einaregilsson.com/redirector/ to test. I just verified that it works as expected.

You can import the rules I used to test: https://gist.githubusercontent.com/llacb47/23a446ac1cc7763a4574a672420626fb/raw/555966752881f67859205c2ab4c579a61d8ad523/redirect-rules.json

I just found that removing parameters also gets rid of SSAI on Discovery Networks sites as well.

Test with these URLs:

https://go.discovery.com/tv-shows/growing-belushi/full-episodes/a-mission-from-god
https://www.tlc.com/tv-shows/90-day-fiance-the-other-way/full-episodes/ready-or-not
https://watch.hgtv.com/tv-shows/fixer-to-fabulous/full-episodes/dave-jennys-pick-dreary-home-gets-bright-update
https://www.sciencechannel.com/tv-shows/unearthed/full-episodes/secrets-of-the-seven-wonders
https://www.ahctv.com/tv-shows/manhunt-kill-or-capture/full-episodes/whitey-bulger-boston-mob-king

and this rule for redirector: https://gist.githubusercontent.com/llacb47/e0c78ffbb44b203da6c975796e7d6608/raw/d4ed40b979d13d6de4d130fe586cfc98383460b2/discovery-redirector.json

liamengland1 commented 3 years ago

Any update?

gorhill commented 3 years ago

It's not trivial to implement with as little impact as possible on performance, so this will have to take the time it takes to implement it.

gorhill commented 3 years ago

It's in the latest dev build. See commit message for usage -- I do not want to provide details in release notes yet, I prefer filter list authors to experiment with usage to find out if fine tuning is necessary.

uBlock-user commented 3 years ago

Usage example - ||reddit.com^$queryprune=utm_, ||youtube.com^$queryprune=fbclid|gclid

gorhill commented 3 years ago

Avoiding queryprune from being visited at all is best, I do hope filter authors will be as carefully as possible when crafting queryprune filters as I am careful at minimizing all overhead in the code -- otherwise all the coding efforts are going to waste. So typically the query parameter of interest will be part of the filter pattern:

||reddit.com^*utm_$queryprune=|utm_
||youtube.com^*fbclid$queryprune=fbclid
||youtube.com^*gclid$queryprune=gclid

This way uBO will scan the query parameters only when the URL is found to match the targeted query parameters. Mind performance when crafting filters. Your proposed filters forces uBO to scan every URL matching reddit.com and youtube.com.

Additionally, prepending queryprune values with | when the match is of the "starts with" kind also helps.

curiosityseeker commented 3 years ago

Many other parameters are shown on the Neat-URL site here and here.

majonezzz commented 3 years ago

This comparision table from ClearURLs Wiki might be useful as well.

gorhill commented 3 years ago

To be clear, the purpose of queryprune is not to replace URL cleaners, so it shouldn't be compared to these -- its purpose is only to remove query parameters, not to rewrite URLs, at most uBO's queryprune seems to match what Neat URL does, nothing more.

lain566 commented 3 years ago

@gorhill And if a parameter contains the destination url, is it possible to keep that parameter and remove the url with no parameter?

gorhill commented 3 years ago

You mean to just remove the parameter value while keeping the parameter name? It's not possible the way I implemented it. Any real use cases demonstrating that an empty parameter value would be useful?

lain566 commented 3 years ago

This for example https://ouo.io/s/BulJXu78?s=http://hackstore.link/alg2i

The destination url is http://hackstore.link/alg2i

I remember there were some warez sites, which force you to go through a shortener, where you have to click on an ad to continue, but I can't find an example link.

gorhill commented 3 years ago

This for example

So you want https://ouo.io/s/BulJXu78?s=http://hackstore.link/alg2i to become https://ouo.io/s/BulJXu78?s=?

lain566 commented 3 years ago

https://ouo.io/s/BulJXu78?s=http://hackstore.link/alg2i to become -> http://hackstore.link/alg2i

gorhill commented 3 years ago

Well in that case it's not a queryprune-related issue, it's in the realm of a URL cleaner, beyond uBO's purpose (well, for the time being, I can imagine in some future a queryjump sort of option, but this feels like feature creep which would bring further requests and so on).

The motivation for the current queryprune was definitely that this solves the video ads issues reported by @llacb47 above. (confirmation that this works would be welcome)

gorhill commented 3 years ago

One issue with a queryjump sort of filter option is that a filter list author could cause uBO to be redirected to a URL which was not meant to be visited -- URLs encoded as query values should not be automatically interpreted as URL meant to be visited. Because of this, this is something best left to a dedicated extension, to not end up with a case of rewrite= option in uBO.

lain566 commented 3 years ago

I can imagine in some future a queryjump sort of option, but this feels like feature creep which would bring further requests and so on).

It is not necessary, I would prefer before, a way to defuse the mechanism that some sites use, to force you to click on an ad to continue, but uBO blocks all those ads, so you are simply stuck, but this is another topic.

gwarser commented 3 years ago

Does it make sense to use separator placeholder (^) to math the boundary of the parameter in the filter matching part?

And how about using = in filter or option to mark boundary on the other side?

! .com/?utm_anything=
! .com/?notutm_u=&utm_anything=

||reddit.com^*^utm_*=$queryprune=|utm_

! &fbclid=

||youtube.com^*^fbclid=$queryprune=|fbclid=

WARNING! Using ...^*^.. is not optimal - https://github.com/uBlockOrigin/uBlock-issues/issues/760#issuecomment-720406961

gwarser commented 3 years ago

Filters:

||example.com^*^ga_*=$queryprune=|ga_
||example.com^*^utm_*=$queryprune=|utm_

Address:

http://example.com/?ga_asdf=2&utm_asdf=1

Result:

The page isn’t redirecting properly

Firefox has detected that the server is redirecting the request for this address in a way that will never complete.

1.30.9b3

WARNING! Using ...^*^.. is not optimal - https://github.com/uBlockOrigin/uBlock-issues/issues/760#issuecomment-720406961

gorhill commented 3 years ago

More than one wildcard cause pattern matching to fallback to regex-based filter. Best to limit occurrences of * and ^.

gorhill commented 3 years ago

to mark boundary on the other side?

I dont understand what you have in mind.

gwarser commented 3 years ago

To match query parameter name as exact as possible it may be required to create two filters ...*?utm_* and ...*&utm_* to match start of the name and for the end of the name it will be equal sign character - ...utm_campaign=.

gorhill commented 3 years ago

match start of the name and for the end of the name it will be equal sign character

The first worry is to avoid * and ^ as much as possible so as to not cause the pattern matching to require an actual regex instance. The second worry is to be sure that the filter is not visited at all (use 3p, 1p, script, etc), and that if it is visited then that the pattern matching is not going to cause a pointless visit to the query-pruning code. With that in mind, these:

||example.com^*^ga_*=$queryprune=|ga_
||example.com^*^utm_*=$queryprune=|utm_

Would be better as:

||example.com^*ga_$queryprune=|ga_
||example.com^*utm_$queryprune=|utm_

The first form requires the pattern matching to be done with FilterPatternGeneric, a regex-based filter -- and possibly a inefficient one given there are two wildcards per filter.

The second form is internally optimized into FilterPatternRightEx (because ^*), a non-regex-based filter. And with the second form, really what are the chance of a matching URL containing an instance of ga_/utm_ and not having ga_/utm_ query parameters to prune? Unlikely, so no point to worry needlessly.

However, I may implement this dev cycle an old idea proposed a long time ago to let the filter author explicitly declare the token to use for a given filter, rather than let the filtering engine pick one. In that case, the filters could be written as:

||example.com^*ga_$queryprune=|ga_,token=ga
||example.com^*utm_$queryprune=|utm_,token=utm

Built-in tokenizer would discard ga or utm as token in above filters because of preceding *, while a filter author knows that what precede those two segments is never a token class character.

Also, the guidelines above apply to any sort of filters, not specific to queryprune -- it's just with modifier filters such as (csp=, queryprune=) all hits need to be collated, while with non-modifier filters only the first hit is returned.

uBlock-user commented 3 years ago

||www.reddit.com/*utm_$doc,1p,queryprune=|utm_

How's that one ? As per the logger, it removes all the utm_ queries in three tries instead of one, not sure what changed.

gorhill commented 3 years ago

it removes all the utm_ queries in three tries instead of one

I could probably answer properly if there was a URL provided for me to visit and reproduce.

uBlock-user commented 3 years ago

Just navigate to https://www.reddit.com/r/random/ with the logger open.

gorhill commented 3 years ago

I could not reproduce with 1.30.9b4, so probably this was the same issue as https://github.com/uBlockOrigin/uBlock-issues/issues/760#issuecomment-720140135.

gorhill commented 3 years ago

Actually, I don't see three tries but then I don't see all the utm_ parameters being removed...

uBlock-user commented 3 years ago

Capture

Like that ?

gorhill commented 3 years ago

Turns out that despite URLSearchParams looking a whole lot like Map objects, they don't behave like Map object when iterating through them. I will replace the last dev build to fix the failure to remove query parameters.

uBlock-user commented 3 years ago

Was working fine yesterday though...

gorhill commented 3 years ago

Was working fine yesterday though...

Why this comment? b3 was not using URLSearchParams() -- it's all in the commit history.

uBlock-user commented 3 years ago

Why this comment?

Because I was wondering what changed since yesterday as today no matter what pattern I tried, it always took three tries, so I was about to give up on this but decided to ask here about the pattern believing something was wrong with that.

gwarser commented 3 years ago

You fell into the same trap I did? https://github.com/Smile4ever/Neat-URL/commit/c21d7e159e90ff7e0f6f07058f67b7be52a54101

gwarser commented 3 years ago

queryprune matching is case-sensitive. (Just a note)

minig0d commented 3 years ago

Turns out that despite URLSearchParams looking a whole lot like Map objects, they don't behave like Map object when iterating through them. I will replace the last dev build to fix the failure to remove query parameters.

Funny that I just ran across this accidentally. I am working on a new extension for stripping tracking parameters and ran into what I assume is the same issue with the iteration behavior.

let testURL = new URL('https://www.facebook.com/user/test/?__cft__[0]=cft0value&size=sizevalue&__tn__=tnvalue&product=productvalue&__cft__[1]=cft1value&fbclid=fbclidvalue&utm_source=ihatetrackers&color=colorvalue&newuser');
console.log(testURL.search); 
// ?__cft__[0]=cft0value&size=sizevalue&__tn__=tnvalue&product=productvalue&__cft__[1]=cft1value&fbclid=fbclidvalue&utm_source=ihatetrackers&color=colorvalue&newuser
testURL.searchParams.forEach((v,k)=>{testURL.searchParams.delete(k);});
console.log(testURL.search);
// ?size=sizevalue&product=productvalue&fbclid=fbclidvalue&color=colorvalue
testURL.searchParams.forEach((v,k)=>{testURL.searchParams.delete(k);});
console.log(testURL.search);
// ?product=productvalue&color=colorvalue
testURL.searchParams.forEach((v,k)=>{testURL.searchParams.delete(k);});
console.log(testURL.search);
// ?color=colorvalue
testURL.searchParams.forEach((v,k)=>{testURL.searchParams.delete(k);});
console.log(testURL.search);
// ""

Yet forEach does clearly iterate over them...

let rslts=[]; testURL.searchParams.forEach((v,k)=>rslts.push(k)); console.log(rslts);
// ["__cft__[0]", "size", "__tn__", "product", "__cft__[1]", "fbclid", "utm_source", "color", "newuser"]

Interestingly, searchParams.keys() into an array and calling .delete() on all the keys, works perfectly. Haven't dug into why.

All of the above I'm sure you already figured out... Just chiming in since you were concerned about performance.

I actually benchmarked a ton of different routes and surprisingly found that a series of chained replace statements (WITH regex even (not just string replaces)) was 220% faster than anything I could come up with using searchParams. (latest stable chrome on W10).

This will definitely be a great addition to uBO!

dimisa-RUAdList commented 3 years ago

Added as an experiment: https://github.com/easylist/ruadlist/commit/43e1157bf9c3eb68c007208e5df3114c88259182

And I found that this rule works in different ways in Google Chrome and Firefox.

Pay attention to the contents of the address bar.

Gif rambler Chrome Gif rambler FF

If you open any material, and then click on the Rambler logo, to return to the main page, then when using Google Chrome, the ending will appear in the address bar: ?utm_source=weekend_media&utm_campaign=self_promo&utm_medium=logo&utm_content=head In addition, this ending sometimes appears after the transition to separate materials. When using Firefox - the address bar will always contain only the address.

Config

Firefox 82.0.3 Google Chrome 86.0.4240.193 uBlock Origin v1.30.9b13 default + [RU AdList](https://subscribe.adblockplus.org/?location=https://easylist-downloads.adblockplus.org/advblock%2Bcssfixes.txt&title=RU%20AdList%20for%20uBlock%20Origin), [Counters](https://subscribe.adblockplus.org/?location=https://easylist-downloads.adblockplus.org/cntblock.txt&title=RU%20AdList:%20Counters)

gorhill commented 3 years ago

@dimisa-RUAdList If you use the logger, you will see no network request for document matching rambler.ru?utm_, so what you see is just the page being dynamically updated and the URL address is just updated, there is no page load -- so no network request for document. I did see network requests matching rambler.ru?utm_, but these were xhr, which your filter doesn't match.

dimisa-RUAdList commented 3 years ago

https://github.com/easylist/ruadlist/commit/74f6fd30707d34344aa6d35403b5a267ae60b221

Nothing changed. In Google Chrome, the ending still appears, in Firefox, only the address itself is still in the address bar.

uBlock-user commented 3 years ago

Is the document request occuring ? If not, then it's because of service-worker.

gorhill commented 3 years ago

In Google Chrome, the ending still appears

As said, use uBO's logger, you won't see network requests for document. When the logger shows network request to document, it is cleaned by queryprune. In doubt, you can also use the browser's dev tools to see that no network requests with utm_ stuff is being made.

gorhill commented 3 years ago

Hmm I do see uBO's logger cleaning up utm_ parameters with ||rambler.ru^*utm_$queryprune=|utm_, but the browser dev tools still show document network requests with utm_ parameters.

gorhill commented 3 years ago

Yeah ok sorry @dimisa-RUAdList, you are right, there is an issue, the querypruning does not occur for tabless requests.

dimisa-RUAdList commented 3 years ago

Super! Now in Google Chrome everything is fine!

gorhill commented 3 years ago

||example.com^*^utm_*=$queryprune=|utm_

I am really second-guessing making this option available to inexperienced people, the issue here shows that the person went directly with the filter above, ignoring everything that followed in the discussion to avoid that kind of filters...

uBlockOrigin / uBlock-issues