tether / roach

A very adaptable web crawler framework. Impossible to kill.
Other
7 stars 1 forks source link

Crawlers should support parsing PDF files #16

Open ekryski opened 10 years ago

ekryski commented 10 years ago

We could just use the ruby parser that has already been written and try to integrate that into roach as a job.

ADunfield commented 10 years ago

When you get to this you should probably have a deep look at the Ruby PDF parser we have. I feel like that guy has worked out some serious magic.

bredele commented 10 years ago

@ADunfield I will @ekryski good point, we could have a ruby version of a job in order to use what's already been done.

ekryski commented 10 years ago

We have 2 ruby scripts for PDFs:

  1. To crawl through the IAR site and grab the PDF's, which get sent to our FTP server
  2. To parse the PDF's and turn them into JSON

What I think we can do is have 2 jobs: