rchipka / node-osmosis

Web scraper for NodeJS
4.12k stars 247 forks source link

Thoughts on collaborating? #2

Open matthewmueller opened 9 years ago

matthewmueller commented 9 years ago

It seems like we're heading in the same direction. I've been working on the following library: https://github.com/lapwinglabs/x-ray.

I really like some of your design decisions here, specifically around offering a native parser and how you're handling an array of items.

I think having native bindings as the default makes sense, but having a fallback on a node-only solution (like ws or bson) would be helpful.

Some things that x-ray adds is a pluggable driver and more fine grain control on how many requests you're making. So it makes sense to me to merge the projects or come up with some way of working together.

Let me know! :-D

rchipka commented 9 years ago

I'm interested in the idea of a collaboration. I noticed that your library was the closest thing I could find to what I'm trying to accomplish with Osmosis.

When you refer to websockets or bson, do you mean supporting those as input or as output?

In a way Osmosis does support a driver/middleware like functionality through the then command, but it doesn't actually augment the parsing functionality like a driver would.

One way to limit the number of requests made is to create custom middleware with then, although there should be a better solution. Another way is to limit the number of links followed. If you only want to follow the first 50 links on a page you can add :limit(50) to the end of your CSS selector.

One decision I'm making with Osmosis is whether to keep it selector oriented or to include commands that do the same thing, such as a .limit(n) command like x-ray has.

I'm also going to add a command similar to x-ray's paginate to replace the current method. The current method is to pass a selector string or callback as the second argument to find or follow.

matthewmueller commented 9 years ago

Awesome, so I'm going to respond inline and then just sort of dump my thoughts on the topic and maybe we can come up with a better solution or converge on our thinking.

When you refer to websockets or bson, do you mean supporting those as input or as output?

They both provide native bindings that are optional, so if the native bindings fail there's a fallback. After looking into it a bit more, while I'm not sure how up-to-date or accurate these stats are, it looks like htmlparser2 may be faster than the native bindings anyway. See: https://github.com/fb55/htmlparser2#performance.

In a way Osmosis does support a driver/middleware like functionality through the then command, but it doesn't actually augment the parsing functionality like a driver would.

Can you provide an example of this? Isn't the osmosis.get, osmosis.post logic internal? Or are you just saying not use those methods and make the request using phantom, proxies, etc. then pass it through the parser?


The way I'm thinking about the next iteration of x-ray is something along the lines of GraphQL: https://speakerdeck.com/relayjs/data-fetching-for-react-applications?slide=36.

Right now, I'm specifically thinking something like this:

var x = xray()
  .driver(fn)
  .concurrency(5)
  .delay(1000)

x('http://google.com/?q=puppy', {
  'title': '.title',
  'items[]': x('.link', {
    'name': 'a',
    'content': x('a@href', {
       'header': 'h1' 
    })
  }).paginate('.next').limit(100)
})

Where the first argument of x can be either a URL, an attribute of a URL, or element to scope the inner selection to.

The way this would work from a flow control, is it will resolve inward in a breadth-first fashion.

The tricky part in my opinion is the error handling and implementing a streaming interface would be nice, so you don't lose your progress if something breaks.

darsain commented 9 years ago

This issue is awesome! Just yesterday I was thinking of the most polite way how to make one of these projects steal the good stuff from the other :D

What I miss from x-ray:

What I miss from osmosis:

What I miss from both:

If these projects meet in the middle, it will be amazing :)

rchipka commented 9 years ago

I like the idea of having an option to use a native or non-native parser. In addition to speed, memory usage is also worth considering when choosing a parser. I'm not sure how the other parsers perform when a large amount of documents are being requested and processed and therefore need to be held in RAM.

What I meant was that then could allow you to take a web page and parse it using something else like phantomjs if needed. It's not the most elegant solution though.

I do like that syntax, that's almost the same way an Osmosis parser looked when I first made it. However, I found that the promise-based look was much better especially for complex parsing and scraping. That way you build the parser down instead of out. I wanted Osmosis to read more like an application specific scraping language than a bunch of nested Javascript.

One difference between Osmosis and x-ray is that x-ray seems more equipped for building deeper JSON objects like { title: "title", "posts" : { total: 10, date: 'today', items:[] }, uploads[] }. I built Osmosis to handle only a single level of data per instance, like just the items[] or just the uploads[]. That way if you need to crawl thousands of listings or news items the objects go in and out of RAM and into a database as fast as possible. So the way it works right now, Osmosis is more for extracting many specific objects you need out of many pages.

Osmosis does have built in concurrency, however there's currently no option for controlling request timing. Osmosis was initially built under the assumption that specific data needs to be requested and extracted in a specific order. As far as dealing with authentication, if you know the session ID cookie you can set it universally with osmosis.config("cookies", []) or per request with the opts argument. If you don't know the session cookie you can get it by using then to post the login form and retrieve it. It would be a good idea to have an option to remember cookies that way they don't have to be dealt with manually. If Osmosis remembered cookies, you could just start it with osmosis.post('example.com/login.php', {user: 'username', pass: 'password'}) and it would all be taken care of from there.

matthewmueller commented 9 years ago

I do like that syntax, that's almost the same way an Osmosis parser looked when I first made it. However, I found that the promise-based look was much better especially for complex parsing and scraping. That way you build the parser down instead of out. I wanted Osmosis to read more like an application specific scraping language than a bunch of nested Javascript.

I see, so in the case of osmosis, how would you support branching? Would you branch in then(fn) statements? Or is branching not on the agenda?

One difference between Osmosis and x-ray is that x-ray seems more equipped for building deeper JSON objects like { title: "title", "posts" : { total: 10, date: 'today', items:[] }, uploads[] }. I built Osmosis to handle only a single level of data per instance, like just the items[] or just the uploads[]. That way if you need to crawl thousands of listings or news items the objects go in and out of RAM and into a database as fast as possible. So the way it works right now, Osmosis is more for extracting many specific objects you need out of many pages.

Yah I can't imagine having too deeply nested objects, but that could definitely be a concern. I was also thinking about using SHAs to allow you to start where you left off if there's a failure. in the same way that Docker has incremental builds or stack (https://github.com/tj/stack#how-it-works). Haven't thought this through enough yet though.

As far as dealing with authentication, if you know the session ID cookie you can set it universally with osmosis.config("cookies", []) or per request with the opts argument. If you don't know the session cookie you can get it by using then to post the login form and retrieve it. It would be a good idea to have an option to remember cookies that way they don't have to be dealt with manually. If Osmosis remembered cookies, you could just start it with osmosis.post('example.com/login.php', {user: 'username', pass: 'password'}) and it would all be taken care of from there.

Yah, the way I'm seeing authentication go down is you kind of need a before and potentially after script (analogous to setup and teardown in tests) to get you to the right spot with the right cookies to start scraping. Maybe this should be done outside of the scraper though, I'm not sure.


Also, thanks @darsain for popping by with some feedback!

nicola commented 9 years ago

+1!

wle8300 commented 9 years ago

+1

danilopopeye commented 9 years ago

:clap:

jmartsch commented 9 years ago

This could be the swiss-army-knife of scraping tools. Hope to see something soon :) +1

MMRandy commented 9 years ago

How ironic. After spending the weekend researching the best potential scraping tools, I settled on two... and here you are already talking about collaboration :)

In any event, very interested to see what's happen's here!

rchipka commented 9 years ago

I can't speak to X-Ray's latest capabilities, however here are some of the things osmosis is doing.

Check out https://github.com/rc0x03/node-osmosis/wiki for more details

matthewmueller commented 9 years ago

Having the ability "opt in" to a headless render prior to scraping a request is a fantastic idea, just not a fan of the current headless rendering options (too heavy, too slow, not very scalable, etc.). Love to see advances in this area.

Agreed 100%. Phantom seems needlessly slow.

Osmosis already has hidden support for creating a DOM and running JS

So I really like the idea of this because it will be much faster than using a headless browser, but the implementation seems to be stubbing out most of the functions, so I don't see how you'll be able to run scripts of any sort of complexity and expect it to render fully.

acao commented 9 years ago

I am using both side by side. Each has their merits!

Phantom would be nice as an option when executing ajax in the DOM is necessary. Otherwise the performance is much greater.

X-ray allows me to to parse read streams.

Otherwise, osmosis has a lot more features that I like, TBH. The promise chain makes it very straightforward. X-ray does do something similar, but there are just more methods. I use .then() a lot with osmosis to be able to filter out or skip documents because I am parsing a lot of squirrely content.

Also, osmosis lets me use xpath. I was able to build a 13f EDGAR parser (for the XML format) in 83 lines, in two hours, and store it in mongo. The python developers on our project were pretty flustered by that.

I can't figure out how to get the raw html with osmosis though. Tried using some xpath methods. this was very handy with x-ray.

Maybe I could write an article that compares the two and proposes some choices for efficient crosswise collaborations?

rchipka commented 9 years ago

I can't figure out how to get the raw html with osmosis though.

@acao the next version of Osmosis will have the keep_data option which will cause Osmosis to store the raw html in context.response.data

The next release of Osmosis has been held up by libxmljs. Upcoming functions in Osmosis depend on the pull requests that I merged to libxmljs being released to NPM. When the latest libxmljs is available on NPM then I'll push the new Osmosis changes.

rchipka commented 9 years ago

X-ray allows me to to parse read streams.

@acao Do you mean parse streaming XML data? Although libxmljs has a streaming parser, Osmosis will never support this functionality because it is document-oriented by nature.

rchipka commented 9 years ago

I don't see how you'll be able to run scripts of any sort of complexity and expect it to render fully.

@matthewmueller Osmosis will extend the Document prototype with DOM object properties. Some of the properties will have "getter" and "setter" functions. It will also create a Window object using the same technique. This will create a fully functional DOM interface that supports AJAX and external resource requests.

albertpeiro commented 9 years ago

+1

Ivanca commented 8 years ago

Just as a suggestion, for this to not bloat the core functionality it should be an opt-in module like expressjs does, so for example

var DOM = require('DOM-simulation');
osmosis.use(DOM());

And maybe the same should be done for any new big functionality in the future.

acao commented 8 years ago

+1 to that. I was dreaming of just this the other day. Composing the instances as such would be a very efficient way of bringing the best of both worlds together.

rchipka commented 8 years ago

@Ivanca this was my original plan when I began implementing a DOM wrapper for libxmljs. However, after testing the DOM wrapper performance I realized that there is very little overhead and an almost unnoticeable memory footprint difference. This is because the DOM wrapper extends the libxmljs Document and Element prototypes. By using "getters" and "setters" for every DOM property possible, Osmosis will only ever load properties as they are needed. This goes for Window, as well. A window object is only created as-needed. The libxmljs-dom project is still in a very early stage and there are many more performance improvements and added features in the upcoming version. One of the setbacks for libxmljs-dom is getting a few libxmljs bugs fixed. Luckily, development on libxmljs is speeding up lately.

Osmosis will work without the DOM capabilities, but it's most likely headed in the "DOM always on" direction. I think most people would prefer to interface with libxml contexts using familiar DOM functions anyways. I assure you that Osmosis will always strive towards the best possible performance, especially when it comes to memory usage. Within the next few releases of Osmosis, DOM support will be greatly improved and stabilized with lots of cool new features.