Nutch errors on VirtualBox shared folders

ahmadia commented 9 years ago

By default, Vagrant maps the "source" directory on the host machine to /vagrant on the client. This is handy, particular when you want to make local source changes and see how it affects the deployed machine.

This can break in situations where the program is running in the local source directory, or when operations on the source directory are sensitive to the file system type.

This is the sort of error Brittain noticed last week. Sample output:

Link inversion
/home/vagrant/miniconda/envs/memex/lib/nutch/bin/nutch invertlinks /vagrant/source/resources/crawls/crawl-2/linkdb /vagrant/source/resources/crawls/crawl-2/segments/20150529184732
LinkDb: starting at 2015-05-29 18:53:01
LinkDb: linkdb: /vagrant/source/resources/crawls/crawl-2/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: /vagrant/source/resources/crawls/crawl-2/segments/20150529184732
LinkDb: merging with existing linkdb: /vagrant/source/resources/crawls/crawl-2/linkdb
LinkDb: finished at 2015-05-29 18:53:03, elapsed: 00:00:02
Dedup on crawldb
/home/vagrant/miniconda/envs/memex/lib/nutch/bin/nutch dedup /vagrant/source/resources/crawls/crawl-2/crawldb
Indexing 20150529184732 to index
/home/vagrant/miniconda/envs/memex/lib/nutch/bin/nutch index -Delastic.index=crawl-2 /vagrant/source/resources/crawls/crawl-2/crawldb -linkdb /vagrant/source/resources/crawls/crawl-2/linkdb /vagrant/source/resources/crawls/crawl-2/segments/20150529184732
Indexer: starting at 2015-05-29 18:53:07
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
    elastic.cluster : elastic prefix cluster
    elastic.host : hostname
    elastic.port : port
    elastic.index : elastic index command 
    elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) 
    elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

Indexer: java.io.FileNotFoundException: File file:/vagrant/source/resources/crawls/crawl-2/linkdb/current/part-00000/data does not exist.
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
    at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:116)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:186)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:196)

For now, in Memex Explorer, we're fixing this issue by running the crawls now in /home/vagrant, which is not mapped, see https://github.com/memex-explorer/memex-explorer/pull/557

The issue is fixed for us, but it's exposed an underlying issue in the way Nutch interacts with a "hostile" file system, and the Nutch developers might want to take a look at this to harden the crawler against similar issues in the future.

cc @lewismc @brittainhard @chrismattmann

ahmadia commented 9 years ago

Closing this issue as it's now resolved.

lewismc commented 9 years ago

Aron and Brittain thank you both for the detailed traces and responses. This is real interesting and I hope there is something we can do down in Nutch to mitigate against this. I'm going to reference this thread over on user@nutch so that people have visibility. Lewis

On Tuesday, June 2, 2015, Aron Ahmadia notifications@github.com wrote:

Closing this issue as it's now resolved.

— Reply to this email directly or view it on GitHub https://github.com/memex-explorer/memex-explorer/issues/558#issuecomment-107988641 .

Lewis

nasa-jpl-memex / memex-explorer

Nutch errors on VirtualBox shared folders #558