Crawlers should support parsing PDF files

ekryski commented 10 years ago

We could just use the ruby parser that has already been written and try to integrate that into roach as a job.

ADunfield commented 10 years ago

When you get to this you should probably have a deep look at the Ruby PDF parser we have. I feel like that guy has worked out some serious magic.

bredele commented 10 years ago

@ADunfield I will @ekryski good point, we could have a ruby version of a job in order to use what's already been done.

ekryski commented 10 years ago

We have 2 ruby scripts for PDFs:

To crawl through the IAR site and grab the PDF's, which get sent to our FTP server
To parse the PDF's and turn them into JSON

What I think we can do is have 2 jobs:

one that triggers/schedules the 1st script to fetch the PDF's
another one that watches the directory on the ftp server and runs the ruby parsing script when the directory changes. It would then get the JSON output somehow and push that through the normal data processing pipeline that roach typically uses (ie. crawler -> redis -> rabbitMQ).

tether / roach