openwpm / OpenWPM

A web privacy measurement framework
https://openwpm.readthedocs.io
Other
1.33k stars 313 forks source link

Duplicate request_id #989

Closed nrllh closed 1 year ago

nrllh commented 2 years ago

I noticed in my dataset that the same request_id was assigned for different requests (although it's rare). This currently means that the request_id in callstacks cannot be clearly assigned.

It is particularly important that I find the right request_id for call stacks. Depending on the timestamp, I could take the first request (after the last request in the callstack), but I'm not sure if it's a reliable solution. Do you have an idea how I can work around the problem?

Here is an example I have in my dataset:

site_id | subpage_id | url | top_level_url | method | referrer | headers | is_XHR | is_third_party_channel | is_third_party_to_top_window | resource_type | time_stamp | is_websocket | body | etld | content_hash | is_tracker | is_background_req | in_scope | window_id | tab_id | frame_id | parent_frame_id | frame_ancestors | request_id | triggering_origin | loading_origin | loading_href | req_call_stack | post_body | post_body_raw | url_scope | global_uniq_id -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 47 | 0 | https://contextual.media.net/cksync.php?cs=1&type=vzn&ovsid={{APID}}&redirect=https%3A%2F%2Fpixel.advertising.com%2Fups%2F58222%2Fsync%3F_origin%3D1%26uid%3D%24UID | https://www.msn.com/de-de/ | GET | https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp | [["Host","contextual.media.net"],["User-Agent","Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"],["Accept","image/avif,image/webp,*/*"],["Accept-Language","en-US,en;q=0.5"],["Accept-Encoding","gzip, deflate, br"],["Referer","https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp"],["Connection","keep-alive"],["Cookie","hbcm_sd=1%7C1646673074314; visitor-id=2896746747280784000V10"],["Sec-Fetch-Dest","image"],["Sec-Fetch-Mode","no-cors"],["Sec-Fetch-Site","same-origin"]] | 0 | 1 | null | image | 2022-03-07T19:11:14.410000 | 0 | null | media.net | null | null | null | null | 1 | 1 | 2147483652 | 2147483649 | [{"frameId":2147483649,"url":"https://contextual.media.net/medianet.php?cid=8CUT39MWR&crid=715624197&size=306x271&https=1"},{"frameId":0,"url":"https://www.msn.com/de-de/"}] | 129 | https://contextual.media.net | https://contextual.media.net | https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp | null | null | null | https://contextual.media.net/cksync.php | 192876 47 | 0 | https://ups.analytics.yahoo.com/ups/58222/sync?_origin=1&uid=0000EEA&apid=UP9841187a-9e39-11ec-a345-061779e0c7c0 | https://www.msn.com/de-de/ | GET | https://contextual.media.net/ | [["Host","ups.analytics.yahoo.com"],["User-Agent","Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"],["Accept","image/avif,image/webp,*/*"],["Accept-Language","en-US,en;q=0.5"],["Accept-Encoding","gzip, deflate, br"],["Referer","https://contextual.media.net/"],["Connection","keep-alive"],["Cookie","A3=d=AQABBLI8JmICEPL2EPXsDfBFliWLBa28-40FEgEBAQGOJ2IwYgAAAAAA_eMAAAcIsjwmYq28-40&S=AQAAAkXJG3i7bt2vymX74kfQ1VQ; B=8rutsllh2cf5i&b=3&s=rs; IDSYNC=18xa~23mh"],["Sec-Fetch-Dest","image"],["Sec-Fetch-Mode","no-cors"],["Sec-Fetch-Site","cross-site"]] | 0 | 1 | null | image | 2022-03-07T19:11:14.939000 | 0 | null | yahoo.com | null | null | null | null | 1 | 1 | 2147483652 | 2147483649 | [{"frameId":2147483649,"url":"https://contextual.media.net/medianet.php?cid=8CUT39MWR&crid=715624197&size=306x271&https=1"},{"frameId":0,"url":"https://www.msn.com/de-de/"}] | 129 | https://contextual.media.net | https://contextual.media.net | https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp | null | null | null | https://ups.analytics.yahoo.com/ups/58222/sync | 193101 47 | 0 | https://pixel.advertising.com/ups/58222/sync?_origin=1&uid=0000EEA | https://www.msn.com/de-de/ | GET | https://contextual.media.net/ | [["Host","pixel.advertising.com"],["User-Agent","Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"],["Accept","image/avif,image/webp,*/*"],["Accept-Language","en-US,en;q=0.5"],["Accept-Encoding","gzip, deflate, br"],["Referer","https://contextual.media.net/"],["Connection","keep-alive"],["Sec-Fetch-Dest","image"],["Sec-Fetch-Mode","no-cors"],["Sec-Fetch-Site","cross-site"]] | 0 | 1 | null | image | 2022-03-07T19:11:14.585000 | 0 | null | advertising.com | null | null | null | null | 1 | 1 | 2147483652 | 2147483649 | [{"frameId":2147483649,"url":"https://contextual.media.net/medianet.php?cid=8CUT39MWR&crid=715624197&size=306x271&https=1"},{"frameId":0,"url":"https://www.msn.com/de-de/"}] | 129 | https://contextual.media.net | https://contextual.media.net | https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp | null | null | null | https://pixel.advertising.com/ups/58222/sync | 192941 47 | 0 | https://pixel.advertising.com/ups/58222/sync?_origin=1&uid=0000EEA&verify=true | https://www.msn.com/de-de/ | GET | https://contextual.media.net/ | [["Host","pixel.advertising.com"],["User-Agent","Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"],["Accept","image/avif,image/webp,*/*"],["Accept-Language","en-US,en;q=0.5"],["Accept-Encoding","gzip, deflate, br"],["Referer","https://contextual.media.net/"],["Connection","keep-alive"],["Cookie","APID=UP9841187a-9e39-11ec-a345-061779e0c7c0"],["Sec-Fetch-Dest","image"],["Sec-Fetch-Mode","no-cors"],["Sec-Fetch-Site","cross-site"]] | 0 | 1 | null | image | 2022-03-07T19:11:14.759000 | 0 | null | advertising.com | null | null | null | null | 1 | 1 | 2147483652 | 2147483649 | [{"frameId":2147483649,"url":"https://contextual.media.net/medianet.php?cid=8CUT39MWR&crid=715624197&size=306x271&https=1"},{"frameId":0,"url":"https://www.msn.com/de-de/"}] | 129 | https://contextual.media.net | https://contextual.media.net | https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp | null | null | null | https://pixel.advertising.com/ups/58222/sync | 193016

PS: global_uniq_id is my intern row number.

vringar commented 2 years ago

Hey, this might be due to these request being part of a redirect chain. Iirc during a single redirect the http channel gets reused. So all of these requests might indeed be triggered by a single call. Try looking at the response_status in the http_responses and see if that brings up anything.

vringar commented 2 years ago

The http_redirects might be outdates/no longer needed.

nrllh commented 2 years ago

Hey, thanks! Yes, it's the case. All of them are redirects. However, I still wonder what this should mean for callstacks. Is request_id in the table callstacks a reference of the last such request or the first one - based on timestamp?

vringar commented 2 years ago

It's a reference to the entire request chain. The script creates the first request, which then returns with a redirect status code and kicks off the second request. So indirectly the script is responsible for both requests, even though it only directly started the first one. So based on timestamp it directly caused the first one but for analysis purposes it might be helpful to create a mapping from callstack to ordered list of redirects.

When we have done such analysis we called those request chains.

nrllh commented 2 years ago

Thank you very much, it helped to solve my issue. So I'm closing the issue.

nrllh commented 2 years ago

@vringar sorry for the spam, but I didn't want to create a new issue for that since it's potentially related to this issue:

Problem 1: As I can see, it's not possible to correlate the requests in call_stack row (in the callstacks table) with the an ID directly. I guess the only option is to compare strings and hope to get the right request id. If there are multiple records with the same request URL, it's very hard to find the right request_id for the requests that appear in call_stack.

Problem 2: Another problem I face is how can I determine which request triggered the next one. As long as I could observe the sequence of requests is either top-down or bottom-up. Here an example:

 instrumentFunction/<@https://space.bilibili.com/7584632:362:25;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:30329;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:23700;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:23299;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:22815;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:22575;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:100310;null
o@https://s1.hdslb.com/bfs/static/jinkela/space/space.ff495225cc805974552c20fc851f8da0f2cd085a.js:1:51142;null
videoExposureReport@https://s1.hdslb.com/bfs/static/jinkela/space/11.space.ff495225cc805974552c20fc851f8da0f2cd085a.js:1:27800;null
770/mounted/</<@https://s1.hdslb.com/bfs/static/jinkela/space11.space.ff495225cc805974552c20fc851f8da0f2cd085a.js:1:27070;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:23700;null
sentryWrapped@https://s1.hdslb.com/bfs/static/jinkela/long/js/sentry/sentry-5.2.1.min.js:2:37520;null

Problem 3: As you can see the URL https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js appears in different sequences. How should I interpret that?

Thank you very much in advance!

vringar commented 2 years ago

Hey,

  1. I'm sorry I don't quite understand this problem. Which ID do you want to correlate it to? The request_id? You can use the request_id to correlate with a redirect chain. Where the first element in the redirect chain is the URL originally called and the last one is the URL which returned with some but a 3XX status code. What other correlation do you want?
  2. I don't think you can determine which request triggered which other one. The callstack is bottom-to-top. So the first function called is sentryWrapped which then calls value which calls 770/mounted/</< however that name came about.
  3. This is because the script is calling other functions in the same script. I'm assuming they are all called value because they are all function objects or whatever the minifier produced. And the call from other things might end up back in the script due to callbacks or smt.
englehardt commented 2 years ago

1/ I think the level of tracing you want to do is just not possible with the instrumentation we have in place right now. The stacks we save come directly from the browser; we don't have a way to label which script URL listed in the stack corresponds to which webRequest ID. That would require a bunch of plumbing throughout the browser to trace properly. Note that if you link a call stack table row back to a web request, then you know which JS context that call is executing. So this is only a problem when there are multiple copies of a script executing in a same exact context (which does happen).

2&3/ it sounds like you might be confusing call stack with HTTP redirects? Like Stefan mentions the call stack shows calling relationships between scripts which are executing in the same JS context, not a series of requests. So scripts can call into each other (or use methods defined in one another).

nrllh commented 2 years ago

Thank you very much, I had some difficulties for understanding the callstacks, but now it's clear.

Not sure if I create an issue, but I can't see for all HTTP redirects their DNS responses. It seems we have only the final request's DNS response of request chains. That means, probably we are missing some data for redirect chains in the table dns_responses.

englehardt commented 1 year ago

I noticed that DNS issue myself and filed #1020 for it.

wesley-tan commented 3 months ago

Hi there! I am an undergraduate researching into browser fingerprinting. So, ultimately,

  1. What is the difference between id and request_id?
  2. How are request_id grouped?
nrllh commented 3 months ago

Hi there! I am an undergraduate researching into browser fingerprinting. So, ultimately,

1. What is the difference between id and request_id?

2. How are request_id grouped?
  1. The idis the row number, which increases independently of request_idor visit_id. The request_idis the ID of HTTP requests, and it resets after each visit.
  2. The data is grouped by visit_id.