gleanerio / gleaner

Gleaner: JSON-LD and structured data on the web harvesting
https://gleaner.io
Apache License 2.0
17 stars 10 forks source link

Multithread headless #239

Closed valentinedwv closed 3 months ago

valentinedwv commented 1 year ago

This is going to be work. Right now when you call the endpoint, and navigate, it's like you typed a name into one tab of the browser. Each time you call the endpoint, it's the same tab of the browser. So when we mutlitrhead, we quickly send N things, and only get one back... all N end up being the same.

Looking at using a session, with and new pages can be spawned in that session, each with it's own definition over the web socket.

There is also the timeout issue.. keeps being exposed. Timeout has to be implemented for each 'tab', and not one context.


Noetoma is 55k items, headless. 149 hours, and at items 7900 so slow. Discovering headless is running at one at a time, through some multi-threaded is done when aquire.go/ResRetrieve called headless page render.

For sitemaps, can we just use a single mthod, and determine if a url is headless from the 'source'. Basically, make the headless path in the aquire/ResRetrieve/getDomain?

fetch headless in the config: https://github.com/gleanerio/gleaner/blob/bfe4140c546565cb7973187a197fb3ea32d3336d/internal/summoner/acquire/acquire.go#L107

If headless, just do a[ pagerender call before this, and return] https://github.com/gleanerio/gleaner/blob/bfe4140c546565cb7973187a197fb3ea32d3336d/internal/summoner/acquire/acquire.go#L144-L151

and if before processing, 40X or 50X, return do not run headless https://github.com/gleanerio/gleaner/blob/bfe4140c546565cb7973187a197fb3ea32d3336d/internal/summoner/acquire/acquire.go#L151-L161

Think we might need to sort the urls in a sitemap, and offer the ability to skip/start at N entries. sorting would provide a consistent order to the items.

Or we need to just have ability to read a list of urls, and let us partition them.

valentinedwv commented 1 year ago

I think that the context is getting shared, and canceled. If I run idientifier over jsonld from a URL I get two different file shas. For: https://data.neotomadb.org/6 I get a Matched identfier, [https://api.neotomadb.org/v2.0/data/downloads/10]

Which means something is mucked up in the multi-threading of the headless.

Here is the log from Neotoma when do multiple threading.

level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=1e8ef4083db3ae6b86435707243cdbb38914cf34 url="https://data.neotomadb.org"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=1e8ef4083db3ae6b86435707243cdbb38914cf34 url="https://data.neotomadb.org/14"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=1e8ef4083db3ae6b86435707243cdbb38914cf34 url="https://data.neotomadb.org/7"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=1e8ef4083db3ae6b86435707243cdbb38914cf34 url="https://data.neotomadb.org/10"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=1e8ef4083db3ae6b86435707243cdbb38914cf34 url="https://data.neotomadb.org/6"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=31dfd3956c1790d09ce4f01c5d2871ad5389f577 url="https://data.neotomadb.org/4"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=31dfd3956c1790d09ce4f01c5d2871ad5389f577 url="https://data.neotomadb.org/5"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=31dfd3956c1790d09ce4f01c5d2871ad5389f577 url="https://data.neotomadb.org/3"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=31dfd3956c1790d09ce4f01c5d2871ad5389f577 url="https://data.neotomadb.org/1"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=31dfd3956c1790d09ce4f01c5d2871ad5389f577 url="https://data.neotomadb.org/11"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=cd73b309c9f847339ccdd2d788305910feb0ced5 url="https://data.neotomadb.org/2"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=cd73b309c9f847339ccdd2d788305910feb0ced5 url="https://data.neotomadb.org/12"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=cd73b309c9f847339ccdd2d788305910feb0ced5 url="https://data.neotomadb.org/8"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=cd73b309c9f847339ccdd2d788305910feb0ced5 url="https://data.neotomadb.org/110"
level=info issue="Uploaded JSONLD to object store" jsonld#=0 sha=cd73b309c9f847339ccdd2d788305910feb0ced5 url="https://data.neotomadb.org/13"
valentinedwv commented 1 year ago

I think we might need to use a session manager: https://pkg.go.dev/github.com/mafredri/cdp#example-package

valentinedwv commented 3 months ago

This works