Data Processing task queue

oyiptong commented 8 years ago

Implement a data processing task queue and worker infrastructure.

The workers need to have network access for obtaining for e.g.:

manifest data and images
readability extracts
fallback image generation

marcoscaceres commented 8 years ago

In theory, we should not need any workers here. The page scraping can generally be done in the content process (so long as we don't do anything silly that would block the main thread), and any necessary fetches are already done asynchronously/non-blocking, with a promise being returned.

For any fallback images, we should maybe just have a default image for now.

oyiptong commented 8 years ago

Does it use the hiddenBrowser?

Do we have concurrency control or is the promise the "queue"?

oyiptong commented 8 years ago

Also, I would prefer not doing it in the main process. the workload might be fairly heavy on I/O and maybe cpu for readability.

We are going to be launching multiple tasks for one URL, after identifying the ones we want to return

oyiptong commented 8 years ago

The workflow goes like this:

hit DB for links, whether history or bookmarks, selecting those we want to return to content
for each link, hit the cache for each of: favicon, manifest data, arc90 extracts
if there is a cache miss, hit the network (possibly cached) via the respective services
store the data on disk

We do the above with both a hot and cold cache, empty or full history. We also do updates, listening on new history entries, the query pattern looks like it includes most recent sites.

All of this is done in the main process. The SQL disk hits are done in a separate thread, so I/O won't block, hopefully.

Favicon hits will be hitting sqlite: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/places/nsFaviconService.cpp

First pass, we could try without task control, letting the promise machinery do the queuing for us. Should the performance be an issue, we can offload the task to a chromeworker or something similar.

oyiptong commented 8 years ago

Hmm. We could get rid of the separate favicon query by combining it with the places query.

marcoscaceres commented 8 years ago

Does it use the hiddenBrowser?

Ideally, it shouldn't.... Depends how we architect it. There might be a bunch of security issues related to fetching images based on the metadata of a page: They need to go through the CSP machinery or may need to do integrity checks. As such, fetching of images, etc. should ideally be done in the security context of the content process (i.e., using contentWindow.fetch(thing).then(showIt)).

Do we have concurrency control or is the promise the "queue"?

The promise is the queue. You can Promise.all(theThings) or you can to one by one:

var getManifest = ManifestObtainer.obtain(content);
var getOpenGraph = ThingThatGetOpenGraphData();
getManifest.then(showIt).catch(ohOh);
getOpenGraph.then(showIt).catch(ohOh);

Or variants of the above.

Also, I would prefer not doing it in the main process. the workload might be fairly heavy on I/O and maybe cpu for readability.

Fetches, at least, should be safe as requests are handled in a thread safe way (i.e., they are done in the network layer, away from the main thread). Writes to the DB presumably can be sent to the parent process (or an entirely different process). We can definitely spin up workers for processing data as needed, but it might be a premature to do so... or the IPC involved could produce it's own lag problems. Anyway, once we get the data from the pages we will should see problems fairly quickly.

marcoscaceres commented 8 years ago

Should the performance be an issue, we can offload the task to a chromeworker or something similar.

Agree.

oyiptong commented 8 years ago

As such, fetching of images, etc. should ideally be done in the security context of the content process

How do we create a hidden browser in the content process? is hiddenDOMWindow in the parent process?

Writes to the DB presumably can be sent to the parent process

Yes, we are doing the DB queries off-main thread. We'll need to communicate via messages to this hidden window in the content process.

marcoscaceres commented 8 years ago

How do we create a hidden browser in the content process? is hiddenDOMWindow in the parent process?

No, that should be in the content process. The question is more about the security properties of a hiddenDOMWindow with respect to what is going to be loaded in there.

oyiptong commented 8 years ago

No, that should be in the content process

So if we obtain Services.appShell.hiddenDOMWindow from the main process, you're saying it would be in the content process? I may be understanding this wrong.

The addon code runs in the main process, and the only thing we have for now that runs in the content process is the addon page, and it is unprivileged. We'll need to have access to the privileged APIs in the content process somehow. Know of a means to achieve this easily?

The question is more about the security properties of a hiddenDOMWindow with respect to what is going to be loaded in there

Assuming we have a hiddenDOMWindow in the content process, what about creating frames and loading the documents in those?

marcoscaceres commented 8 years ago

So if we obtain Services.appShell.hiddenDOMWindow from the main process, you're saying it would be in the content process? I may be understanding this wrong.

Yeah, sorry, we seem to be talking past each other. I would be surprised if you can obtain hiddenDOMWindow from the parent (main?) process, but maybe I'm wrong. It would be weird to access hiddenDOMWindow from the parent process because it would violate e10s expectations.

The context in which we are going to display the information is going to be a HTML document, so we can leverage that Document to load stuff for us as needed. Because of this, I'm hopeful we can avoid using hiddenDOMWindow altogether.

Assuming we have a hiddenDOMWindow in the content process, what about creating frames and loading the documents in those?

Creating frames might now work, again for security reasons, because a lot of sites prevent themselves from being loaded in an iframe (via X-FRAME-OPTIONS http header). However, that doesn't matter because so long as we have access to the child process, we have access to the contentWindow and hence we can create whatever HTML elements we want (or call any JS API we want, in a secure way).

Anyway, I think the best way to validate our assumptions is to get coding and try it out.

oyiptong commented 8 years ago

Ok. Let's find out about some things. I'd rather we think this through a little bit before going to code. I think this will ensure we do the least amount of work for what we need. We have 1 month for a shipped product.

Open questions in my mind are:

How do we open a hidden window in the content process?
How do we control access to this hidden window?

2 could be by using the promise task queue as a queue in the parent (main) process. We'll be communicating with this content process via messages. We'll need to build timeouts.

Once we have this IPC setup, it would be smooth sailing.

marcoscaceres commented 8 years ago

We might need to set up a time to chat a bit more about this - cause I'm still not sure I fully grasp what we are trying to achieve in this bug. In my mind, it would work like:

user navigates to a website (via URL bar, or via Google, or whatever).
After page load, we go into the page, and scrape whatever metadata is there:
- If there is a link@rel=manifest (found with ManifestFinder.jsm), we kick off ManifestObtainer.jsm to fetch us the manifest. Doing the check will take 1 tick from the content process - so it's super fast. Kicking off the fetching of the manifest is as fast as the network connection can be established, and the file downloaded ... processing is super fast, and done off main thread.
- Else, If there is no manifest, scrape the page for the application-name, icons, etc. etc., and shove that in the Places DB. Return;
Once we have the manifest, we fetch the appropriate icon (I have some code somewhere that does the intelligent selection already - it leverages <picture>).
Once we have everything we need, we shove all that into the DB.

The challenges we face are (but we might choose to ignore):

The user kills the tab before we have all the stuff - so we don't get to the point where we can write to the DB (no biggy). That makes sense from a privacy standpoint.
The content switches metadata on us on the fly... like, the list of icons or the manifest URL changes in the DOM. For meta tags, this would be kinda annoying as we need to listen for DOM changes.... there is some gecko code that does this already, but I don't know if it covers everything we need (e.g., the user changing application-name or dynamically adding opengraph data to the head of the document).

In my mind, the most important thing is that we never create new documents or anything like that. We only leverage the "contentDocument" of the tab which the user has opened. If that tab is killed, then the whole process of writing the metadata to the DB is killed with it. Updating a site's preview, etc. thus would only happen when the user returns to a particular site. That way, we never ping a server or access resources without the user first navigating to a given URL - as that protects user's privacy by pinging sites. Leveraging the "contentDocument" also affords us all the security properties of the page - so that we don't go off and load things we are not supposed to load unintentionally.

Is the above totally off base?

oyiptong commented 8 years ago

This is one access pattern out of a few: "most recent".

We need to cater for a few more:

most frecent
synced
bookmarked

All of those may start from a cold boot.

For most recent, it's easy enough, you are correct, modulo some edge cases: tab closes after history entry but before manifest/readability parsing.

marcoscaceres commented 8 years ago

@oyiptong, do you agree that we should never navigate/scrape a site without the user being on that site? Just want to be sure we are on the same page.

oyiptong commented 8 years ago

Ah, I think that's why we were talking through each other. We need a task queue if we scrape sites without the user being on that site.

Given that we need to scrape the site without them being on them, we need to manage the tasks.

There are a few situations where we need to do so:

new activity streams user with existing history
new history items coming in from sync
cache invalidation causing misses
user page load interruption (enough for history entry, not enough for scraping)

k88hudson commented 8 years ago

We've decided for this iteration due to time constraints we're going to use embedly to collect required data, so we won't need this. However, this would be good to reference in the future

oyiptong commented 8 years ago

worked on in another repo: https://github.com/mozilla/loci/issues/4, https://github.com/mozilla/loci/issues/5, https://github.com/mozilla/loci/issues/7

jaredlockhart commented 8 years ago

Backlogging this for now until the results of the first local metadata experiment are in.

mozilla / activity-stream

Data Processing task queue #25