pes10k / pagegraph-query

extract information about iframes from pagegraph files
3 stars 2 forks source link

Issue on Line 63 of `graph/__init__.py` #2

Closed AlbertoFDR closed 6 months ago

AlbertoFDR commented 7 months ago

Sorry for bothering you again, I don't know if you are still working on this. It's just I have a similar project to this for parsing the graphs with python (not that clean obviously) :)

The code in Line 63 of pagegraph/graph/__init__.py crashes with the error TypeError: 'module' object is not iterable in the line for insert_edge in self.insert_edges():. I suppose that the correct call is edges() not insert_edges() and I don't know which kind of functionality is missing there.

Thanks again!

pes10k commented 7 months ago

yep, sorry, i shouldn't have pushed code so fast. I just saw a i had already completed what you were asking about in your last issue. I will try to get this sorted and pushed today

pes10k commented 6 months ago

@AlbertoFDR this should be sorted now with 144b465904faff1e6de7c44b7417936abde47cf8, though if you are still seeing any issues please reopen. Thanmks!

pes10k commented 6 months ago

also @AlbertoFDR would be very interested to learn more about what kinds of queries you're doing against pagegraph data. Im trying to stresstest and check the correctness of some of this stuff, so would be great to try and expand it to more use cases

AlbertoFDR commented 6 months ago

I'm still finding some issues (I don't think it's because my brave version):

About my queries or about what I'm trying to do. As @L3thal14 suggested some months ago (https://github.com/brave/brave-browser/issues/35130), it would be nice to catch all (or almost all) the WebAPI's and builtins calls on Pagegraph. Recording response headers (e.g., XFO, CSP, PP...) from the documents would be also really great. And my final suggestion is to add the attributes of the HTML elements, not just the ones that are dynamically created, and be able to recover the changes. I know that some of the suggestions could be hard/impossible to maintain, such as, the experimental attr.

More technical questions about Pagegraph. What is the logic behind the frame owner node? So, I've seen (if I remember correctly) that in cases that there is no src or that the src is about:blank or javascript: there isn't a frame owner node. In the other case, for iframes that includes a specific path of the same domain, the frame owner is created. My guess is that the first ones runs in the same context but for the second case, even if they have access, they are running in different context. Also, I noticed that for each frame owner there is a first DOM root empty (about:blank), I guess this is for isolation, but for Pagegraph case we don't need it. Thanks again :)

AlbertoFDR commented 6 months ago

In addition for my previous comment, for the JS calling part, it would be also nice to have some traceability of the calling somehow. I've think, that something similar to what Devtools does, like even if the script is minified you can see the script prettified and where the call comes from (Initiator tab). This idea, could be also very challenging for the implementation I guess.

pes10k commented 6 months ago

I don't think it's because my brave version

Please make sure you are using the most recent nightly version, or building your own versions. There have been several signifigant changes in the last few weeks

Using a dummy local page with one iframe…

I believe this is fixed with the most recent pagegraph-query push

it would be nice to catch all (or almost all) the WebAPI's and builtins calls on Pagegraph

A large number are already caught. You can add as many more as you like by specifying the ones you want to catch in this file as described on the wiki

that in cases that there is no src or that the src is about:blank or javascript: there isn't a frame owner node

This is not correct. The frame owner is the iframe (or similar) that can contain any number of different documents (as represented by DOMRootNode instances) by changing the src or of the iframe. But there will always be a FrameOwnerNode for every iframe (and similar)

nice to have some traceability of the calling

I am not sure I understand here. PageGraph makes it clear which script is responsible for the JS call, and even provides the exact line and character offset in the JS file responsibile. What additional traceability are you looking for?

if the script is minified you can see the script prettified

I see how this could be handy, but its pretty far outside what PageGraph is looking to do. Though, you shouldn't have any problem extracting the JS souce from PageGraph, and running it through any formatter / beautifier you like, and / or mapping the call site PageGraph identifies in the original source to the formatted source.

AlbertoFDR commented 6 months ago

Yes, you are right! My bad. Thanks!