voldemort / voldemort

An open source clone of Amazon's Dynamo.
http://project-voldemort.com
Apache License 2.0
2.64k stars 584 forks source link

Let avro.rec.schema be a file, not a string #284

Closed elad closed 9 years ago

elad commented 9 years ago

Hello,

Would you be receptive to a pull request that changes the semantics such that avro.rec.schema is a path to a schema file (e.g. avro.rec.schema=/path/to/schema.avsc) rather than a string? It makes more sense to me, especially when working with complex schemas.

Note that --is-avro doesn't seem to work (the option parsing is wrong, see #283) so I think there shouldn't be compatibility issues.

FelixGV commented 9 years ago

Hi,

Can you elaborate on what you're trying to accomplish?

Looking at #283, I imagine you're trying to build a Read-Only data set on Hadoop.

Unfortunately, that script, HadoopStoreJobRunner.java, has not been actively maintained for at least 4-5 years. Some automatic refactorings touched it, but overall, it hasn't been maintained. It probably "works", for some definition of working, but I bet it's missing a lot of improvements that have gone into VoldemortBuildAndPushJob, which is the preferred way of pushing data from Hadoop to Voldemort RO.

Currently, VoldemortBuildAndPushJob is slighltly coupled with Azkaban, but it shouldn't be a big deal to decouple it and make it run standalone, so that you can benefit from all of its latest improvements, but free of Azkaban. If you need to use Oozie or any other non-Azkaban scheduler, that is likely a more viable strategy rather than dusting off those old scripts.

Regarding the Avro schema definition, the job should not need to have it defined explicitly, since it uses the one in the AvroSequenceFile's header. In VoldemortBuildAndPushJob, there are also some automated sanity checks that make sure the server's schema/store definition is the same as the one provided in the data (and it will create it properly on the server if it doesn't exist yet).

-F

elad commented 9 years ago

Hey @FelixGV,

Sure! You guessed right - I want to build a small (4 node, 10-15tb of data) Voldemort cluster with read-only data that will be built on Hadoop and pushed once a month or so.

From my experience, even after applying the fixes (bypassing Kerberos, setting --is-avro, properly passing the schema, using newer Avro), HadoopStoreJobRunner.java still doesn't work and the job hangs when starting. I was hoping to get some fixes in the official distribution and then discuss how to figure out what the problem is, but now it seems the way to go is (Azkaban's) VoldemortBuildAndPushJob.

Ideally I want to avoid setting up another job scheduler and its related dependencies, but just to verify the setup works before I start decoupling the code that's probably what I'll do.

I agree regarding the schema (it's in the Avro file), it's just that the non-working code seems to have needed it. Since it's not maintained and doesn't work I guess this issue can be closed. :)

FelixGV commented 9 years ago

Even though VoldemortBuildAndPushJob extends some Azkaban class, it shouldn't need anything from Azkaban itself in order to work.

It should be trivial to create a VoldemortBuildAndPushJobRunner which has a main() method that reads in params, constructs a Props instance, passes it to the VoldemortBuildAndPushJob's constructor, and then calls run() on it.

I would think that'd be easier than setting up Azkaban. (Setting up Azkaban is not that hard, but definitely not worth it if you don't plan to use it long term).

As for HadoopStoreJobRunner, I think we should just delete it in order to remove confusion... There's no point in trying to run just that independently of the VoldemortSwapJob. You always need both, and doing both is accomplished by the VBnPJob.

My 2 cents.

-F

elad commented 9 years ago

@FelixGV I agree. I started looking into Azkaban just to get a feel of it (not too fond of Oozie...) and setup a solo server. After a bit of fiddling I managed to run the wordcount example and now I'm trying to run the actual job.

However, the use of (very) old dependencies by Voldemort is causing trouble with my Hadoop cluster. I already mentioned I have to manually pull in avro-1.7.7 above, and I've seen HDFS incompatibilities too (IPC version 4 vs. 9) But now I see a more severe problem with protobuf. The protobuf version used by Voldemort is 2.3.0 while Hadoop uses 2.5.0. They're binary incompatible and it looks like I'll have to resort to shading it for Voldemort.

What's the word on updating dependencies here? :) It would make using and contributing to these tools (outside LinkedIn...) much easier.

FelixGV commented 9 years ago

Yeah, the old dependencies are definitely annoying.

What we do internally at LinkedIn is that we shade protobuf to 2.3 and avro to 1.4, so that whatever environment we run on, things always work.

Unfortunately, this shading is currently done as part of an internal build process we have... It should really be put back into open-source though.

I'm sorry that this is kind of a mess right now ): ...

-F

elad commented 9 years ago

What we do internally at LinkedIn is that we shade protobuf to 2.3 and avro to 1.4, so that whatever environment we run on, things always work.

It's relieving to hear that's what you guys do! Any chance you could elaborate a bit, maybe even throw in a few pieces from the (gradle?) build scripts to show how that gets done? Do you include both versions of avro and protobuf and shade/transform 1.4 and 2.3 (respectively) for internal uses? Or do you just create a "fat" jar with up-to-date versions of everything except for these two libraries?

I'm sorry that this is kind of a mess right now ): ...

It's all good. :) Once I get a setup running I don't mind doing a little writeup about it in the wiki or wherever. I've seen a few folks are having similar issues and I think it's a shame because Voldemort has very interesting features.

FelixGV commented 9 years ago

Hi @elad,

I put together this small run script: https://github.com/FelixGV/voldemort/blob/bnp_runner_script/bin/run-bnp.sh

I haven't tested it completely. It might not work as is, but is perhaps a good starting point.

We probably still need to add the shading stuff to the build for it to run on Hadoop 2 clusters (Hadoop 1 should be fine). For that, I used com.github.johnrengelman.shadow, I can try to get to that later, if you haven't already. I include just Voldemort's versions of our dependencies in a fat jar, but with their package name changed so that they don't clash with what the Hadoop run-time already provides.

-F

elad commented 9 years ago

Hey @FelixGV,

Thanks for the script!

We probably still need to add the shading stuff to the build for it to run on Hadoop 2 clusters (Hadoop 1 should be fine). For that, I used com.github.johnrengelman.shadow, I can try to get to that later, if you haven't already. I include just Voldemort's versions of our dependencies in a fat jar, but with their package name changed so that they don't clash with what the Hadoop run-time already provides.

It would be great if you could put that up as a gist or even just paste it here. I'd really like to have as few deviations from the official version as possible. From a quick glance at the mailing list and issues I can tell I'm probably not the first nor last to be interested in having Voldemort at least having a chance of working with Hadoop 2 out of the box. ;)

elad commented 9 years ago

@FelixGV also, is VoldemortBuildAndPushJobRunner missing or is it something I should write myself? I don't see it anywhere.

FelixGV commented 9 years ago

Well, it's not merged into master, since I haven't fully verified that it works all right. It's in the branch I linked to earlier. On Sun, Jul 26, 2015 at 01:22 Elad Efrat notifications@github.com wrote:

@FelixGV https://github.com/FelixGV also, is VoldemortBuildAndPushJobRunner missing or is it something I should write myself? I don't see it anywhere.

— Reply to this email directly or view it on GitHub https://github.com/voldemort/voldemort/issues/284#issuecomment-124957403 .

elad commented 9 years ago

I got it to work using the following changes to build.gradle:

// From https://github.com/voldemort/voldemort/pull/274
buildscript {
  repositories { jcenter() }
  dependencies {
    classpath 'com.github.jengelman.gradle.plugins:shadow:1.2.1'
  }
}

apply plugin: 'java' // or 'groovy'. Must be explicitly applied
apply plugin: 'com.github.johnrengelman.shadow'

shadowJar {
    from sourceSets.main.output, sourceSets.contrib.output, sourceSets.test.output, sourceSets.main.resources

    // Required when working in an Hadoop 2.x environment
    dependencies {
        include(dependency('org.apache.avro:avro:1.4.0'))
        include(dependency('com.google.protobuf:protobuf-java:2.3.0'))
    }
    relocate 'com.google.protobuf', 'voldemort.com.google.protobuf'
    // relocate 'org.apache.avro', 'voldemort.org.apache.avro'
}

Getting it to work also required downloading the java/hadoop jobtypes for azkaban and using a particular ordering of the classpath parameter in the job configuration. I will soon submit a pull request and write it all up in the wiki so it's easier for others who are having trouble as well.

However, now I'm seeing a much more worrying issue: Using the example configuration files (with a minor port number change), the single-node cluster works fine, but when I load data to the two-node cluster I get data loss. That is, some of the keys return null despite being in the source data and showing up when grepping the actual data files.

Any thoughts on what may be causing this?

FelixGV commented 9 years ago

Cool, I'm glad you got the shadow build running!

Can you post your job logs, server/store configs and client-side error message in a gist? It shouldn't return null when you have more nodes, of course...

elad commented 9 years ago

Hey @FelixGV, sure!

First, I'd like to point out two things that really made a difference:

  1. The classpath in the azkaban job file should be in the following form:

    classpath=VOLDEMORT_HOME/build/libs/voldemort-repo-1.9.18-all.jar:LOCAL_HADOOP_CLASSPATH:VOLDEMORT_HOME/lib/*:HADOOP_CONF

    Where VOLDEMORT_HOME is where the release was extracted or the repository cloned; LOCAL_HADOOP_CLASSPATH is one or more local directories with Hadoop jars relevant for talking to the cluster; and HADOOP_CONF is usually /etc/hadoop/conf. The order is mandatory to first load the shaded Voldemort jar, then the rest of the Hadoop libraries that take precedence over unshaded dependencies, then the rest of Voldemort's dependencies, and finally the Hadoop cluster configuration to avoid "Wrong FS" ("got hdfs... but expected file..") exceptions.

  2. HDFS paths (for build.input.path and build.output.dir for example) had to be in the following form:

    build.input.path=hdfs://NAMENODE:NAMENODE_PORT/path/to/data
    build.output.dir=hdfs://NAMENODE:NAMENODE_PORT/tmp

    Where NAMENODE is the Hadoop NameNode and NAMENODE_PORT is usually 50070. I'm sure there's a way to allow simple paths but I haven't figured it out yet and as you'll see below I'm facing more severe issues. :)

Now to the problem described above.

I'm using Voldemort 1.9.18 with the example configuration from two_node_cluster with the following three lines added to each node's server.properties:

enable.readonly.engine=true
file.fetcher.class=voldemort.store.readonly.fetcher.HdfsFetcher
readonly.hadoop.config.path=/etc/hadoop/conf

Everything else is identical in both server.properties and cluster.xml.

I start an instance of the server in two terminals using bin/voldemort-server.sh two_node_cluster/node_0 and bin/voldemort-server.sh two_node_cluster/node_1. The servers come up fine. (Note that they're both on the same machine using the same IP but different ports. I assume that's fine because the unchanged cluster configuration allows two nodes on "localhost")

I then submit a job to Azkaban with some sample data (size is 530kb). The job finishes successfully, but like I said, a sanity test that tries to "get" every key in the data shows that out of 100 keys, 42 are missing. Looking at data size in each node's directory I see 36k in node_0 and 3.8mb in node_1. So for some reason, data doesn't get to node 0 (or, alternatively, to partitions 0 and 1). The client doesn't show any errors, neither during build-and-push nor when using voldemort-shell.sh to read values.

Here's the server log for node 0, the one that seems to be "empty":

[07:40:39,924 voldemort.store.metadata.MetadataStore] INFO metadata init(). [main]
[07:40:40,090 voldemort.cluster.Node] WARN admin-port not defined for node:0 using default value(socket_port + 1):6667 [main]
[07:40:40,090 voldemort.cluster.Node] WARN admin-port not defined for node:1 using default value(socket_port + 1):6671 [main]
[07:40:40,859 voldemort.store.metadata.MetadataStore] INFO Updating routing strategy for all stores [main]
[07:40:41,015 voldemort.server.VoldemortServer] INFO Using NIO Connector for Admin Service. [main]
[07:40:41,059 voldemort.server.VoldemortServer] INFO Using NIO Connector. [main]
[07:40:41,059 voldemort.common.service.VoldemortService] INFO Starting voldemort-server [main]
[07:40:41,109 voldemort.utils.JNAUtils] WARN Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out with higher Young gen stalls Increase RLIMIT_MEMLOCK or run Voldemort as root. [main]
[07:40:41,109 voldemort.server.VoldemortServer] INFO Starting 6 services. [main]
[07:40:41,109 voldemort.common.service.VoldemortService] INFO Starting storage-service [main]
[07:40:41,234 voldemort.server.storage.StorageService] INFO Initializing bdb storage engine. [main]
[07:40:41,239 voldemort.server.storage.StorageService] INFO Initializing read-only storage engine. [main]
[07:40:41,261 voldemort.server.storage.StorageService] INFO Opening system store 'voldsys$_client_registry' (memory). [main]
[07:40:41,301 voldemort.server.storage.StorageService] INFO Scheduling data retention cleanup job for store 'voldsys$_client_registry' at Tue Jul 28 00:00:00 CDT 2015 with retention scan throttle rate:2147483647 Entries/second. [main]
[07:40:41,307 voldemort.server.storage.StorageService] INFO Opening system store 'voldsys$_metadata_version_persistence' (file-backed-cache). [main]
[07:40:41,341 voldemort.server.storage.StorageService] INFO Opening system store 'voldsys$_store_quotas' (file-backed-cache). [main]
[07:40:41,373 voldemort.server.storage.StorageService] INFO Initializing the slop store using bdb [main]
[07:40:41,376 voldemort.store.bdb.BdbStorageConfiguration] INFO Creating BDB data directory '/home/elad/voldemort/two_node_cluster/node_0/data/bdb/slop. [main]
[07:40:41,907 voldemort.store.bdb.BdbStorageConfiguration] INFO Creating environment for slop:  [main]
[07:40:41,907 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB cache size = 1073741824 [main]
[07:40:41,907 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB je.cleaner.threads = 1 [main]
[07:40:41,908 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB je.cleaner.minUtilization = 50 [main]
[07:40:41,908 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB je.cleaner.minFileUtilization = 0 [main]
[07:40:41,908 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB je.log.fileMax = 62914560 [main]
[07:40:41,908 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB allowCreate=true,cacheSize=1073741824,txnNoSync=false,txnWriteNoSync=false,{je.tree.maxDelta=100, je.cleaner.bytesInterval=31457280, je.maxMemory=1073741824, je.cleaner.adjustUtilization=false, je.log.faultReadSize=2048, je.env.isTransactional=true, je.cleaner.maxBatchFiles=0, je.cleaner.minUtilization=50, je.env.recoveryForceCheckpoint=false, je.tree.binDelta=75, je.cleaner.fetchObsoleteSize=true, je.cleaner.lookAheadCacheSize=8192, je.lock.nLockTables=7, je.cleaner.threads=1, je.log.iteratorReadSize=8192, je.lock.timeout=500 MILLISECONDS, je.checkpointer.wakeupInterval=30000000, je.checkpointer.highPriority=false, je.checkpointer.bytesInterval=209715200, je.cleaner.lazyMigration=false, je.cleaner.minFileUtilization=0, je.txn.durability=NO_SYNC,NO_SYNC,SIMPLE_MAJORITY, je.sharedCache=true, je.env.fairLatches=false, je.log.fileMax=62914560}, [main]
[07:40:41,978 voldemort.server.storage.StorageService] INFO Initializing slop pusher job type streaming at Mon Jul 27 07:45:41 CDT 2015 [main]
[07:40:41,980 voldemort.server.storage.StorageService] INFO Initializing Slop Purge job [main]
[07:40:41,982 voldemort.server.storage.StorageService] INFO Initializing repair job. [main]
[07:40:41,984 voldemort.server.storage.StorageService] INFO Intializing prune job [main]
[07:40:41,988 voldemort.server.storage.StorageService] INFO Initializing stores: [main]
[07:40:41,988 voldemort.server.storage.StorageService] INFO Validating schemas: [main]
[07:40:42,097 voldemort.server.storage.StorageService] INFO All stores initialized. [main]
[07:40:42,097 voldemort.common.service.VoldemortService] INFO Starting scheduler-service [main]
[07:40:42,097 voldemort.common.service.VoldemortService] INFO Starting async-scheduler [main]
[07:40:42,097 voldemort.server.protocol.admin.AsyncOperationService] INFO Starting asyncOperationRunner [main]
[07:40:42,097 voldemort.common.service.VoldemortService] INFO Starting rebalance-service [main]
[07:40:42,097 voldemort.common.service.VoldemortService] INFO Starting socket-service [main]
[07:40:42,097 voldemort.server.niosocket.NioSocketService] INFO Starting Voldemort NIO socket server (admin-server) on port 6667 [main]
[07:40:42,113 voldemort.server.niosocket.NioSocketService] INFO Server now listening for connections on port 6667 [voldemort-admin-server.Acceptor]
[07:40:42,118 voldemort.common.service.VoldemortService] INFO Starting jmx-service [main]
[07:40:42,142 voldemort.server.jmx.JmxService] WARN Overwriting mbean voldemort.server.niosocket:type=NioSocketService [main]
[07:40:42,146 voldemort.common.service.VoldemortService] INFO Starting http-service [main]
[07:40:42,205 voldemort.server.http.gui.ReadOnlyStoreManagementServlet] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [main]
[07:40:42,208 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [main]
[07:40:42,235 voldemort.server.http.gui.ReadOnlyStoreManagementServlet] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [main]
[07:40:42,235 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [main]
[07:40:42,240 voldemort.server.http.HttpService] INFO HTTP service started on port 8081 [main]
[07:40:42,240 voldemort.common.service.VoldemortService] INFO Starting socket-service [main]
[07:40:42,240 voldemort.server.niosocket.NioSocketService] INFO Starting Voldemort NIO socket server (nio-socket-server) on port 6666 [main]
[07:40:42,252 voldemort.server.niosocket.NioSocketService] INFO Server now listening for connections on port 6666 [voldemort-nio-socket-server.Acceptor]
[07:40:42,255 voldemort.server.VoldemortServer] INFO Startup completed in 1146 ms. [main]
[07:45:37,688 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=44442,localport=6666]: voldemort-native-v1 [voldemort-nio-socket-server-t1]
[07:45:37,791 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=44443,localport=6666]: voldemort-native-v1 [voldemort-nio-socket-server-t2]
[07:45:38,247 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [voldemort-admin-server-t1]
[07:45:38,247 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [voldemort-admin-server-t1]
[07:45:38,247 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=51769,localport=6667]: admin-v1 [voldemort-admin-server-t1]
[07:45:38,559 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Received request for supported compression codecs [voldemort-admin-server-t1]
[07:45:41,978 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Started streaming slop pusher job at Mon Jul 27 07:45:41 CDT 2015 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:42,066 voldemort.client.AbstractStoreClientFactory] INFO Client zone-id [-1] Attempting to get raw store [voldsys$_metadata_version_persistence]  [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:42,077 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=44452,localport=6666]: voldemort-native-v1 [voldemort-nio-socket-server-t3]
[07:45:42,099 voldemort.client.AbstractStoreClientFactory] INFO Client zone-id [-1] Attempting to get raw store [voldsys$_store_quotas]  [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:42,107 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Acquiring lock to perform streaming slop pusher job  [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:42,107 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Acquired lock to perform streaming slop pusher job  [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:42,119 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 0 - Succeeded - 0 - Attempted - 0 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:42,119 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 1 - Succeeded - 0 - Attempted - 0 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:42,119 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Completed streaming slop pusher job which started at Mon Jul 27 07:45:41 CDT 2015 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:47,743 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=44453,localport=6666]: voldemort-native-v1 [voldemort-nio-socket-server-t4]
[07:46:42,668 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=44460,localport=6666]: voldemort-native-v1 [voldemort-nio-socket-server-t5]
[07:46:42,756 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=44461,localport=6666]: voldemort-native-v1 [voldemort-nio-socket-server-t6]
[07:46:43,198 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [voldemort-admin-server-t2]
[07:46:43,198 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [voldemort-admin-server-t2]
[07:46:43,198 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=51788,localport=6667]: admin-v1 [voldemort-admin-server-t2]
[07:46:43,202 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Received request for supported compression codecs [voldemort-admin-server-t2]
[07:46:44,682 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Adding new store 'foobar' [voldemort-admin-server-t2]
[07:46:44,682 voldemort.server.storage.StorageService] INFO Opening store 'foobar' (read-only). [voldemort-admin-server-t2]
[07:46:44,733 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Creating symbolic link for 'foobar' using directory /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-0 [voldemort-admin-server-t2]
[07:46:44,869 voldemort.store.readonly.chunk.ChunkedFileSet] INFO No index or data files found, creating empty files for partition 0 and replicating partition 0 [voldemort-admin-server-t2]
[07:46:44,944 voldemort.store.readonly.chunk.ChunkedFileSet] INFO No index or data files found, creating empty files for partition 1 and replicating partition 0 [voldemort-admin-server-t2]
[07:46:44,948 voldemort.store.readonly.StoreVersionManager] INFO Successfully synced internal state from file-system: StoreVersionManager { currentVersion: 0, versionToEnabledMap: {0: true}, rootDir: two_node_cluster/node_0/data/read-only/foobar } [voldemort-admin-server-t2]
[07:46:44,973 voldemort.utils.StoreDefinitionUtils] INFO Validating schema for store:  foobar [voldemort-admin-server-t2]
[07:46:45,267 voldemort.store.metadata.MetadataStore] INFO Updating routing strategy for all stores [voldemort-admin-server-t2]
[07:46:45,268 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Successfully added new store 'foobar' [voldemort-admin-server-t2]
[07:46:49,771 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Received GetHighAvailabilitySettingsRequest [voldemort-admin-server-t2]
[07:46:49,786 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [voldemort-admin-server-t3]
[07:46:49,786 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [voldemort-admin-server-t3]
[07:46:49,787 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=51822,localport=6667]: admin-v1 [voldemort-admin-server-t3]
[07:46:49,795 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=44500,localport=6666]: voldemort-native-v1 [voldemort-nio-socket-server-t7]
[07:46:49,805 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Received GetHighAvailabilitySettingsRequest [voldemort-admin-server-t3]
[07:46:49,815 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [voldemort-admin-server-t4]
[07:46:49,815 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [voldemort-admin-server-t4]
[07:46:49,815 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=51824,localport=6667]: admin-v1 [voldemort-admin-server-t4]
[07:46:49,827 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=44503,localport=6666]: voldemort-native-v1 [voldemort-nio-socket-server-t8]
[07:46:49,834 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Started executing fetch of hftp://hadoop-namenode:50070/tmp/localhost/node-0 for RO store 'foobar' version 1 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,848 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [voldemort-admin-server-t5]
[07:46:49,848 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [voldemort-admin-server-t5]
[07:46:49,848 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=51827,localport=6667]: admin-v1 [voldemort-admin-server-t5]
[07:46:49,862 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Started executing fetch of hftp://hadoop-namenode:50070/tmp/localhost/node-0 for RO store 'foobar' version 1 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,902 voldemort.store.readonly.fetcher.HdfsFetcher] INFO URL : hftp://hadoop-namenode:50070/tmp/localhost/node-0 and hftp protocol enabled = true [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,902 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Hadoop path = /etc/hadoop/conf , keytab path = two_node_cluster/node_0/config/voldemrt.headless.keytab , kerberos principal = voldemrt [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,903 voldemort.store.readonly.fetcher.HdfsFetcher] INFO URL : hftp://hadoop-namenode:50070/tmp/localhost/node-0 and hftp protocol enabled = true [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,903 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Hadoop path = /etc/hadoop/conf , keytab path = two_node_cluster/node_0/config/voldemrt.headless.keytab , kerberos principal = voldemrt [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,484 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Starting fetch for : hftp://hadoop-namenode:50070/tmp/localhost/node-0 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,489 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Starting fetch for : hftp://hadoop-namenode:50070/tmp/localhost/node-0 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,502 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Attempt 0 at copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/.metadata to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/.metadata [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,502 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Attempt 0 at copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/.metadata to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/.metadata [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,540 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Completed copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/.metadata to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/.metadata [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,541 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Completed copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/.metadata to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/.metadata [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,599 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Attempt 0 at copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/1_0_0.data to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/1_0_0.data [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,599 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Attempt 0 at copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/1_0_0.data to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/1_0_0.data [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,322 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Completed copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/1_0_0.data to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/1_0_0.data [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,436 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Attempt 0 at copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/1_0_0.index to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/1_0_0.index [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,437 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Completed copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/1_0_0.data to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/1_0_0.data [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,449 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Completed copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/1_0_0.index to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/1_0_0.index [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,463 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Attempt 0 at copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/1_0_0.index to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/1_0_0.index [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,482 voldemort.store.readonly.fetcher.HdfsDirectory] INFO Checksum generated from streaming - e07a41ffa2b0d693560844b530ffe758 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,482 voldemort.store.readonly.fetcher.HdfsDirectory] INFO Checksum on file - e07a41ffa2b0d693560844b530ffe758 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,482 voldemort.store.readonly.fetcher.HdfsDirectory] INFO Check-sum verification - true [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,482 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Completed fetch : hftp://hadoop-namenode:50070/tmp/localhost/node-0 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,523 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Successfully executed fetch of hftp://hadoop-namenode:50070/tmp/localhost/node-0 for RO store 'foobar' [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,523 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Completed copy of hftp://hadoop-namenode:50070/tmp/localhost/node-0/1_0_0.index to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1/1_0_0.index [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,557 voldemort.store.readonly.fetcher.HdfsDirectory] INFO Checksum generated from streaming - e07a41ffa2b0d693560844b530ffe758 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,557 voldemort.store.readonly.fetcher.HdfsDirectory] INFO Checksum on file - e07a41ffa2b0d693560844b530ffe758 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,557 voldemort.store.readonly.fetcher.HdfsDirectory] INFO Check-sum verification - true [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,557 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Completed fetch : hftp://hadoop-namenode:50070/tmp/localhost/node-0 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:22,572 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Successfully executed fetch of hftp://hadoop-namenode:50070/tmp/localhost/node-0 for RO store 'foobar' [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:53,349 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Swapping RO store 'foobar' to version directory '/home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1' [voldemort-admin-server-t4]
[07:47:53,349 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Swapping files for store 'foobar' to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1 [voldemort-admin-server-t4]
[07:47:53,350 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Acquiring write lock on 'foobar': [voldemort-admin-server-t4]
[07:47:53,351 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Opening primary files for store 'foobar' at /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1 [voldemort-admin-server-t4]
[07:47:53,352 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Creating symbolic link for 'foobar' using directory /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1 [voldemort-admin-server-t4]
[07:47:53,372 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Swapping RO store 'foobar' to version directory '/home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1' [voldemort-admin-server-t5]
[07:47:53,372 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Swapping files for store 'foobar' to /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1 [voldemort-admin-server-t5]
[07:47:53,372 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Acquiring write lock on 'foobar': [voldemort-admin-server-t5]
[07:47:53,435 voldemort.store.readonly.chunk.ChunkedFileSet] INFO No index or data files found, creating empty files for partition 0 and replicating partition 0 [voldemort-admin-server-t4]
[07:47:53,442 voldemort.store.readonly.StoreVersionManager] INFO Successfully synced internal state from file-system: StoreVersionManager { currentVersion: 1, versionToEnabledMap: {0: true, 1: true}, rootDir: two_node_cluster/node_0/data/read-only/foobar } [voldemort-admin-server-t4]
[07:47:53,442 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Swap operation completed successfully on store foobar, releasing lock. [voldemort-admin-server-t4]
[07:47:53,443 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Swapping swapped RO store 'foobar' to version directory '/home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1' [voldemort-admin-server-t4]
[07:47:53,444 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Opening primary files for store 'foobar' at /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1 [voldemort-admin-server-t5]
[07:47:53,444 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Creating symbolic link for 'foobar' using directory /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1 [voldemort-admin-server-t5]
[07:47:53,450 voldemort.store.readonly.StoreVersionManager] INFO Successfully synced internal state from file-system: StoreVersionManager { currentVersion: 1, versionToEnabledMap: {0: true, 1: true}, rootDir: two_node_cluster/node_0/data/read-only/foobar } [voldemort-admin-server-t5]
[07:47:53,450 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Swap operation completed successfully on store foobar, releasing lock. [voldemort-admin-server-t5]
[07:47:53,451 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Swapping swapped RO store 'foobar' to version directory '/home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1' [voldemort-admin-server-t5]

Here's the server log for node 1, the one with the data:

[07:40:45,881 voldemort.store.metadata.MetadataStore] INFO metadata init(). [main]
[07:40:46,022 voldemort.cluster.Node] WARN admin-port not defined for node:0 using default value(socket_port + 1):6667 [main]
[07:40:46,023 voldemort.cluster.Node] WARN admin-port not defined for node:1 using default value(socket_port + 1):6671 [main]
[07:40:46,682 voldemort.store.metadata.MetadataStore] INFO Updating routing strategy for all stores [main]
[07:40:46,788 voldemort.server.VoldemortServer] INFO Using NIO Connector for Admin Service. [main]
[07:40:46,830 voldemort.server.VoldemortServer] INFO Using NIO Connector. [main]
[07:40:46,830 voldemort.common.service.VoldemortService] INFO Starting voldemort-server [main]
[07:40:46,871 voldemort.utils.JNAUtils] WARN Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out with higher Young gen stalls Increase RLIMIT_MEMLOCK or run Voldemort as root. [main]
[07:40:46,871 voldemort.server.VoldemortServer] INFO Starting 6 services. [main]
[07:40:46,871 voldemort.common.service.VoldemortService] INFO Starting storage-service [main]
[07:40:46,997 voldemort.server.storage.StorageService] INFO Initializing bdb storage engine. [main]
[07:40:47,002 voldemort.server.storage.StorageService] INFO Initializing read-only storage engine. [main]
[07:40:47,023 voldemort.server.storage.StorageService] INFO Opening system store 'voldsys$_client_registry' (memory). [main]
[07:40:47,064 voldemort.server.storage.StorageService] INFO Scheduling data retention cleanup job for store 'voldsys$_client_registry' at Tue Jul 28 00:00:00 CDT 2015 with retention scan throttle rate:2147483647 Entries/second. [main]
[07:40:47,069 voldemort.server.storage.StorageService] INFO Opening system store 'voldsys$_metadata_version_persistence' (file-backed-cache). [main]
[07:40:47,103 voldemort.server.storage.StorageService] INFO Opening system store 'voldsys$_store_quotas' (file-backed-cache). [main]
[07:40:47,136 voldemort.server.storage.StorageService] INFO Initializing the slop store using bdb [main]
[07:40:47,138 voldemort.store.bdb.BdbStorageConfiguration] INFO Creating BDB data directory '/home/elad/voldemort/two_node_cluster/node_1/data/bdb/slop. [main]
[07:40:47,591 voldemort.store.bdb.BdbStorageConfiguration] INFO Creating environment for slop:  [main]
[07:40:47,591 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB cache size = 1073741824 [main]
[07:40:47,591 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB je.cleaner.threads = 1 [main]
[07:40:47,591 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB je.cleaner.minUtilization = 50 [main]
[07:40:47,591 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB je.cleaner.minFileUtilization = 0 [main]
[07:40:47,591 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB je.log.fileMax = 62914560 [main]
[07:40:47,592 voldemort.store.bdb.BdbStorageConfiguration] INFO     BDB allowCreate=true,cacheSize=1073741824,txnNoSync=false,txnWriteNoSync=false,{je.tree.maxDelta=100, je.cleaner.bytesInterval=31457280, je.maxMemory=1073741824, je.cleaner.adjustUtilization=false, je.log.faultReadSize=2048, je.env.isTransactional=true, je.cleaner.maxBatchFiles=0, je.cleaner.minUtilization=50, je.env.recoveryForceCheckpoint=false, je.tree.binDelta=75, je.cleaner.fetchObsoleteSize=true, je.cleaner.lookAheadCacheSize=8192, je.lock.nLockTables=7, je.cleaner.threads=1, je.log.iteratorReadSize=8192, je.lock.timeout=500 MILLISECONDS, je.checkpointer.wakeupInterval=30000000, je.checkpointer.highPriority=false, je.checkpointer.bytesInterval=209715200, je.cleaner.lazyMigration=false, je.cleaner.minFileUtilization=0, je.txn.durability=NO_SYNC,NO_SYNC,SIMPLE_MAJORITY, je.sharedCache=true, je.env.fairLatches=false, je.log.fileMax=62914560}, [main]
[07:40:47,662 voldemort.server.storage.StorageService] INFO Initializing slop pusher job type streaming at Mon Jul 27 07:45:47 CDT 2015 [main]
[07:40:47,664 voldemort.server.storage.StorageService] INFO Initializing Slop Purge job [main]
[07:40:47,666 voldemort.server.storage.StorageService] INFO Initializing repair job. [main]
[07:40:47,668 voldemort.server.storage.StorageService] INFO Intializing prune job [main]
[07:40:47,671 voldemort.server.storage.StorageService] INFO Initializing stores: [main]
[07:40:47,672 voldemort.server.storage.StorageService] INFO Validating schemas: [main]
[07:40:47,853 voldemort.server.storage.StorageService] INFO All stores initialized. [main]
[07:40:47,853 voldemort.common.service.VoldemortService] INFO Starting scheduler-service [main]
[07:40:47,853 voldemort.common.service.VoldemortService] INFO Starting async-scheduler [main]
[07:40:47,853 voldemort.server.protocol.admin.AsyncOperationService] INFO Starting asyncOperationRunner [main]
[07:40:47,853 voldemort.common.service.VoldemortService] INFO Starting rebalance-service [main]
[07:40:47,854 voldemort.common.service.VoldemortService] INFO Starting socket-service [main]
[07:40:47,854 voldemort.server.niosocket.NioSocketService] INFO Starting Voldemort NIO socket server (admin-server) on port 6671 [main]
[07:40:47,870 voldemort.server.niosocket.NioSocketService] INFO Server now listening for connections on port 6671 [voldemort-admin-server.Acceptor]
[07:40:47,875 voldemort.common.service.VoldemortService] INFO Starting jmx-service [main]
[07:40:47,899 voldemort.server.jmx.JmxService] WARN Overwriting mbean voldemort.server.niosocket:type=NioSocketService [main]
[07:40:47,902 voldemort.common.service.VoldemortService] INFO Starting http-service [main]
[07:40:47,961 voldemort.server.http.gui.ReadOnlyStoreManagementServlet] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [main]
[07:40:47,964 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [main]
[07:40:47,991 voldemort.server.http.gui.ReadOnlyStoreManagementServlet] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [main]
[07:40:47,991 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [main]
[07:40:47,995 voldemort.server.http.HttpService] INFO HTTP service started on port 8085 [main]
[07:40:47,995 voldemort.common.service.VoldemortService] INFO Starting socket-service [main]
[07:40:47,995 voldemort.server.niosocket.NioSocketService] INFO Starting Voldemort NIO socket server (nio-socket-server) on port 6670 [main]
[07:40:48,007 voldemort.server.niosocket.NioSocketService] INFO Server now listening for connections on port 6670 [voldemort-nio-socket-server.Acceptor]
[07:40:48,010 voldemort.server.VoldemortServer] INFO Startup completed in 1139 ms. [main]
[07:45:37,888 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=37592,localport=6670]: voldemort-native-v1 [voldemort-nio-socket-server-t1]
[07:45:37,918 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=37593,localport=6670]: voldemort-native-v1 [voldemort-nio-socket-server-t2]
[07:45:47,663 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Started streaming slop pusher job at Mon Jul 27 07:45:47 CDT 2015 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:47,730 voldemort.client.AbstractStoreClientFactory] INFO Client zone-id [-1] Attempting to get raw store [voldsys$_metadata_version_persistence]  [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:47,765 voldemort.client.AbstractStoreClientFactory] INFO Client zone-id [-1] Attempting to get raw store [voldsys$_store_quotas]  [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:47,773 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Acquiring lock to perform streaming slop pusher job  [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:47,773 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Acquired lock to perform streaming slop pusher job  [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:47,784 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 0 - Succeeded - 0 - Attempted - 0 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:47,784 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Slops to node 1 - Succeeded - 0 - Attempted - 0 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:45:47,784 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Completed streaming slop pusher job which started at Mon Jul 27 07:45:47 CDT 2015 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:42,848 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=37610,localport=6670]: voldemort-native-v1 [voldemort-nio-socket-server-t3]
[07:46:42,875 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=37611,localport=6670]: voldemort-native-v1 [voldemort-nio-socket-server-t4]
[07:46:45,286 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [voldemort-admin-server-t1]
[07:46:45,286 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [voldemort-admin-server-t1]
[07:46:45,286 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=43855,localport=6671]: admin-v1 [voldemort-admin-server-t1]
[07:46:45,620 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Adding new store 'foobar' [voldemort-admin-server-t1]
[07:46:45,620 voldemort.server.storage.StorageService] INFO Opening store 'foobar' (read-only). [voldemort-admin-server-t1]
[07:46:45,855 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Creating symbolic link for 'foobar' using directory /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-0 [voldemort-admin-server-t1]
[07:46:46,121 voldemort.store.readonly.chunk.ChunkedFileSet] INFO No index or data files found, creating empty files for partition 2 and replicating partition 0 [voldemort-admin-server-t1]
[07:46:46,197 voldemort.store.readonly.chunk.ChunkedFileSet] INFO No index or data files found, creating empty files for partition 3 and replicating partition 0 [voldemort-admin-server-t1]
[07:46:46,201 voldemort.store.readonly.StoreVersionManager] INFO Successfully synced internal state from file-system: StoreVersionManager { currentVersion: 0, versionToEnabledMap: {0: true}, rootDir: two_node_cluster/node_1/data/read-only/foobar } [voldemort-admin-server-t1]
[07:46:46,226 voldemort.utils.StoreDefinitionUtils] INFO Validating schema for store:  foobar [voldemort-admin-server-t1]
[07:46:46,520 voldemort.store.metadata.MetadataStore] INFO Updating routing strategy for all stores [voldemort-admin-server-t1]
[07:46:46,521 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Successfully added new store 'foobar' [voldemort-admin-server-t1]
[07:46:49,826 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [voldemort-admin-server-t2]
[07:46:49,826 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [voldemort-admin-server-t2]
[07:46:49,826 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=43885,localport=6671]: admin-v1 [voldemort-admin-server-t2]
[07:46:49,834 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Started executing fetch of hftp://hadoop-namenode:50070/tmp/localhost/node-1 for RO store 'foobar' version 1 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,858 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Loading fetcher voldemort.store.readonly.fetcher.HdfsFetcher [voldemort-admin-server-t3]
[07:46:49,858 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Created HdfsFetcher: no throttler, buffer size = 65536 bytes, reporting interval = 26214400 bytes, fetcher socket timeout = 1800000 ms. [voldemort-admin-server-t3]
[07:46:49,858 voldemort.server.niosocket.AsyncRequestHandler] INFO Protocol negotiated for Socket[addr=/127.0.0.1,port=43888,localport=6671]: admin-v1 [voldemort-admin-server-t3]
[07:46:49,862 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Started executing fetch of hftp://hadoop-namenode:50070/tmp/localhost/node-1 for RO store 'foobar' version 1 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,902 voldemort.store.readonly.fetcher.HdfsFetcher] INFO URL : hftp://hadoop-namenode:50070/tmp/localhost/node-1 and hftp protocol enabled = true [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,902 voldemort.store.readonly.fetcher.HdfsFetcher] INFO URL : hftp://hadoop-namenode:50070/tmp/localhost/node-1 and hftp protocol enabled = true [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,902 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Hadoop path = /etc/hadoop/conf , keytab path = two_node_cluster/node_1/config/voldemrt.headless.keytab , kerberos principal = voldemrt [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:49,902 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Hadoop path = /etc/hadoop/conf , keytab path = two_node_cluster/node_1/config/voldemrt.headless.keytab , kerberos principal = voldemrt [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,418 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Starting fetch for : hftp://hadoop-namenode:50070/tmp/localhost/node-1 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,418 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Starting fetch for : hftp://hadoop-namenode:50070/tmp/localhost/node-1 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,433 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Attempt 0 at copy of hftp://hadoop-namenode:50070/tmp/localhost/node-1/.metadata to /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1/.metadata [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,433 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Attempt 0 at copy of hftp://hadoop-namenode:50070/tmp/localhost/node-1/.metadata to /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1/.metadata [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,482 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Completed copy of hftp://hadoop-namenode:50070/tmp/localhost/node-1/.metadata to /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1/.metadata [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,540 voldemort.store.readonly.fetcher.BasicFetchStrategy] INFO Completed copy of hftp://hadoop-namenode:50070/tmp/localhost/node-1/.metadata to /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1/.metadata [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,561 voldemort.store.readonly.fetcher.HdfsDirectory] INFO No check-sum verification required [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,561 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Completed fetch : hftp://hadoop-namenode:50070/tmp/localhost/node-1 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,561 voldemort.store.readonly.fetcher.HdfsDirectory] INFO No check-sum verification required [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,561 voldemort.store.readonly.fetcher.HdfsFetcher] INFO Completed fetch : hftp://hadoop-namenode:50070/tmp/localhost/node-1 [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,607 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Successfully executed fetch of hftp://hadoop-namenode:50070/tmp/localhost/node-1 for RO store 'foobar' [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:46:50,638 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Successfully executed fetch of hftp://hadoop-namenode:50070/tmp/localhost/node-1 for RO store 'foobar' [java.util.concurrent.ThreadPoolExecutor$Worker]
[07:47:53,446 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Swapping RO store 'foobar' to version directory '/home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1' [voldemort-admin-server-t2]
[07:47:53,446 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Swapping files for store 'foobar' to /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1 [voldemort-admin-server-t2]
[07:47:53,447 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Acquiring write lock on 'foobar': [voldemort-admin-server-t2]
[07:47:53,449 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Opening primary files for store 'foobar' at /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1 [voldemort-admin-server-t2]
[07:47:53,449 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Creating symbolic link for 'foobar' using directory /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1 [voldemort-admin-server-t2]
[07:47:53,452 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Swapping RO store 'foobar' to version directory '/home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1' [voldemort-admin-server-t3]
[07:47:53,452 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Swapping files for store 'foobar' to /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1 [voldemort-admin-server-t3]
[07:47:53,452 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Acquiring write lock on 'foobar': [voldemort-admin-server-t3]
[07:47:53,602 voldemort.store.readonly.chunk.ChunkedFileSet] INFO No index or data files found, creating empty files for partition 2 and replicating partition 0 [voldemort-admin-server-t2]
[07:47:53,670 voldemort.store.readonly.chunk.ChunkedFileSet] INFO No index or data files found, creating empty files for partition 3 and replicating partition 0 [voldemort-admin-server-t2]
[07:47:53,674 voldemort.store.readonly.StoreVersionManager] INFO Successfully synced internal state from file-system: StoreVersionManager { currentVersion: 1, versionToEnabledMap: {0: true, 1: true}, rootDir: two_node_cluster/node_1/data/read-only/foobar } [voldemort-admin-server-t2]
[07:47:53,675 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Swap operation completed successfully on store foobar, releasing lock. [voldemort-admin-server-t2]
[07:47:53,675 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Swapping swapped RO store 'foobar' to version directory '/home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1' [voldemort-admin-server-t2]
[07:47:53,676 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Opening primary files for store 'foobar' at /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1 [voldemort-admin-server-t3]
[07:47:53,676 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Creating symbolic link for 'foobar' using directory /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1 [voldemort-admin-server-t3]
[07:47:53,681 voldemort.store.readonly.StoreVersionManager] INFO Successfully synced internal state from file-system: StoreVersionManager { currentVersion: 1, versionToEnabledMap: {0: true, 1: true}, rootDir: two_node_cluster/node_1/data/read-only/foobar } [voldemort-admin-server-t3]
[07:47:53,681 voldemort.store.readonly.ReadOnlyStorageEngine] INFO Swap operation completed successfully on store foobar, releasing lock. [voldemort-admin-server-t3]
[07:47:53,682 voldemort.server.protocol.admin.AdminServiceRequestHandler] INFO Swapping swapped RO store 'foobar' to version directory '/home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1' [voldemort-admin-server-t3]

If you see anything wrong with the setup or would like more information, please let me know!

FelixGV commented 9 years ago

Hmm, weird,

It seems like you have two pushes happening in parallel on each of your two nodes. Did you put both nodes in the push.cluster config property of your BnP job? (you should put only one URI per cluster, which can be a single node or a VIP, as you wish).

-F

elad commented 9 years ago

Nope, push.cluster is set to only one node (the "bootstrap node"?).

Does it work for you when both nodes are on the same machine?

FelixGV commented 9 years ago

Also, you should enable just BDB or just ReadOnly in your storage engines. In theory they should be able to coexist, but in practice they don't play well together, and BDB will print a bunch of useless server logs (well, useless if you don't use it anyway).

FelixGV commented 9 years ago

I haven't tried both nodes on the same machine, but it should work...

Where is your HDFS cluster? Is it also on the same machine?

Can you print your BnP logs?

-F

FelixGV commented 9 years ago

The other weird thing is that you get "No index or data files found, creating empty files" which shouldn't be the case if BnP executed properly...

-F

elad commented 9 years ago

Also, you should enable just BDB or just ReadOnly in your storage engines. In theory they should be able to coexist, but in practice they don't play well together, and BDB will print a bunch of useless server logs (well, useless if you don't use it anyway).

If I don't enable it, I get the following exception:

[10:28:48,135 voldemort.server.storage.StorageService] INFO Initializing the slop store using bdb [main]
Exception in thread "main" voldemort.utils.ConfigurationException: Attempt to open store slop but bdb storage engine has not been enabled.
    at voldemort.server.storage.StorageService.startInner(StorageService.java:290)
    at voldemort.common.service.AbstractService.start(AbstractService.java:62)
    at voldemort.server.VoldemortServer.startInner(VoldemortServer.java:370)
    at voldemort.common.service.AbstractService.start(AbstractService.java:62)
    at voldemort.server.VoldemortServer.main(VoldemortServer.java:433)

Where is your HDFS cluster? Is it also on the same machine?

The HDFS cluster is made of a few machines, one of them runs azkaban and the two voldemort instances. They're all on the same network in case it matters...

Can you print your BnP logs?

I don't have the logs from the run above, but I do have logs from other runs though with similar results. Here's an interesting excerpt, see below:

27-07-2015 06:53:12 PDT foobar INFO - INFO Down to the last merge-pass, with 1 segments left of total size: 3938021 bytes
27-07-2015 06:53:12 PDT foobar INFO - INFO 
27-07-2015 06:53:12 PDT foobar INFO - INFO  map 100% reduce 0%
27-07-2015 06:53:15 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/vold-build-and-push--4216749106822415158/foobar.attempt_local414568512_0001_r_000000_0.index
27-07-2015 06:53:16 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/vold-build-and-push--4216749106822415158/foobar.attempt_local414568512_0001_r_000000_0.data
27-07-2015 06:53:16 PDT foobar INFO - INFO Opening hdfs://hadoop-cluster:8020/tmp/vold-build-and-push--4216749106822415158/foobar.attempt_local414568512_0001_r_000000_0.index and hdfs://hadoop-cluster:8020/tmp/vold-build-and-push--4216749106822415158/foobar.attempt_local414568512_0001_r_000000_0.data for writing.
27-07-2015 06:53:24 PDT foobar INFO - INFO reduce > reduce
27-07-2015 06:53:24 PDT foobar INFO - INFO  map 100% reduce 100%
27-07-2015 06:53:37 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/localhost/node-1
27-07-2015 06:53:44 PDT foobar INFO - INFO Moving hdfs://hadoop-cluster:8020/tmp/vold-build-and-push--4216749106822415158/foobar.attempt_local414568512_0001_r_000000_0.index to hdfs://hadoop-cluster:8020/tmp/localhost/node-1/3_0_0.index
27-07-2015 06:53:45 PDT foobar INFO - INFO Moving hdfs://hadoop-cluster:8020/tmp/vold-build-and-push--4216749106822415158/foobar.attempt_local414568512_0001_r_000000_0.data to hdfs://hadoop-cluster:8020/tmp/localhost/node-1/3_0_0.data
27-07-2015 06:53:46 PDT foobar INFO - INFO Task:attempt_local414568512_0001_r_000000_0 is done. And is in the process of commiting
27-07-2015 06:53:46 PDT foobar INFO - INFO reduce > reduce
27-07-2015 06:53:46 PDT foobar INFO - INFO Task attempt_local414568512_0001_r_000000_0 is allowed to commit now
27-07-2015 06:53:47 PDT foobar INFO - INFO Saved output of task 'attempt_local414568512_0001_r_000000_0' to hdfs://hadoop-cluster:8020/tmp/vold-build-and-push--4216749106822415158
27-07-2015 06:53:47 PDT foobar INFO - INFO reduce > reduce
27-07-2015 06:53:47 PDT foobar INFO - INFO Task 'attempt_local414568512_0001_r_000000_0' done.
27-07-2015 06:53:51 PDT foobar INFO - INFO Job complete: job_local414568512_0001
27-07-2015 06:53:54 PDT foobar INFO - INFO Counters: 26
27-07-2015 06:53:54 PDT foobar INFO - INFO   File System Counters
27-07-2015 06:53:54 PDT foobar INFO - INFO     FILE: Number of bytes read=19400411
27-07-2015 06:53:54 PDT foobar INFO - INFO     FILE: Number of bytes written=23723072
27-07-2015 06:53:54 PDT foobar INFO - INFO     FILE: Number of read operations=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     FILE: Number of large read operations=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     FILE: Number of write operations=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     HDFS: Number of bytes read=1211906
27-07-2015 06:53:54 PDT foobar INFO - INFO     HDFS: Number of bytes written=3937154
27-07-2015 06:53:54 PDT foobar INFO - INFO     HDFS: Number of read operations=34
27-07-2015 06:53:54 PDT foobar INFO - INFO     HDFS: Number of large read operations=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     HDFS: Number of write operations=18
27-07-2015 06:53:54 PDT foobar INFO - INFO   Map-Reduce Framework
27-07-2015 06:53:54 PDT foobar INFO - INFO     Map input records=100
27-07-2015 06:53:54 PDT foobar INFO - INFO     Map output records=100
27-07-2015 06:53:54 PDT foobar INFO - INFO     Map output bytes=3937610
27-07-2015 06:53:54 PDT foobar INFO - INFO     Input split bytes=143
27-07-2015 06:53:54 PDT foobar INFO - INFO     Combine input records=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     Combine output records=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     Reduce input groups=100
27-07-2015 06:53:54 PDT foobar INFO - INFO     Reduce shuffle bytes=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     Reduce input records=100
27-07-2015 06:53:54 PDT foobar INFO - INFO     Reduce output records=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     Spilled Records=200
27-07-2015 06:53:54 PDT foobar INFO - INFO     CPU time spent (ms)=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     Physical memory (bytes) snapshot=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     Virtual memory (bytes) snapshot=0
27-07-2015 06:53:54 PDT foobar INFO - INFO     Total committed heap usage (bytes)=338821120
27-07-2015 06:53:54 PDT foobar INFO - INFO   File Input Format Counters 
27-07-2015 06:53:54 PDT foobar INFO - INFO     Bytes Read=541643
27-07-2015 06:53:54 PDT foobar INFO - INFO Number of collisions in the job - 0
27-07-2015 06:53:54 PDT foobar INFO - INFO Maximum number of collisions for one entry - 0
27-07-2015 06:53:54 PDT foobar INFO - INFO No data generated for node 0. Generating empty folder
27-07-2015 06:53:56 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/localhost/node-0
27-07-2015 06:53:56 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/localhost/node-0/.metadata
27-07-2015 06:54:00 PDT foobar INFO - INFO Checksum for node 1 - a6e56d4b5fe7b55e12d108f3f3c4f469
27-07-2015 06:54:02 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/localhost/node-1/.metadata
27-07-2015 06:54:02 PDT foobar INFO - INFO Pushing to cluster URL: tcp://localhost:6666

The other weird thing is that you get "No index or data files found, creating empty files" which shouldn't be the case if BnP executed properly...

Right. Also note (in the server logs) "No check-sum verification required" for node 0. I suspect that's because the Hadoop job isn't creating any data for it or at least not balancing it properly (see the excerpt above, especially the first twenty or so lines).

I was thinking that maybe the partitions weren't setup right. I had node 0 with partitions 0, 1 and node 1 with partitions 2, 3. I tried changing it so that node 0 had partitions 0, 2 and node 1 had partitions 1, 3 and now had 52 keys missing out of 100. :/

Any ideas what might be causing it and where to look?

FelixGV commented 9 years ago

Hmm... Would you mind pasting the whole BnP log? (Doesn't matter if it's not the same run as the one corresponding to the above server logs... as long as the issue still happens)

As for BDB, everything should be fine if you do bdb.enable=false and slop.enable=false.

elad commented 9 years ago

Sure, here it is:

27-07-2015 08:56:22 PDT foobar INFO - Starting job foobar at 1438012582984
27-07-2015 08:56:22 PDT foobar INFO - Building java job executor. 
27-07-2015 08:56:22 PDT foobar INFO - HADOOP_HOME not set, using default hadoop config.
27-07-2015 08:56:22 PDT foobar INFO - 1 commands to execute.
27-07-2015 08:56:23 PDT foobar INFO - Command: java  -Xms64M -Xmx256M -cp /home/elad/voldemort/voldemort-repo/build/libs/voldemort-repo-1.9.18-all.jar:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/hadoop-0.20-mapreduce/*:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/hadoop/client/*:/home/elad/voldemort/voldemort-repo/lib/*:/etc/hadoop/conf:/home/elad/azkaban-solo-2.5.0/plugins/jobtypes/java/azkaban-jobtype-2.5.0.jar:/home/elad/azkaban-solo-2.5.0/lib/azkaban-2.5.0.jar:/home/elad/azkaban-solo-2.5.0/plugins/jobtypes/java/azkaban-hadoopsecuritymanager-2.5.0.jar:/home/elad/azkaban-solo-2.5.0/lib/log4j-1.2.16.jar azkaban.jobtype.JavaJobRunnerMain 
27-07-2015 08:56:23 PDT foobar INFO - Environment variables: {JOB_NAME=foobar, JOB_PROP_FILE=/home/elad/azkaban-solo-2.5.0/executions/95/foobar/foobar_props_6486788353689170462_tmp, JOB_OUTPUT_PROP_FILE=/home/elad/azkaban-solo-2.5.0/executions/95/foobar/foobar_output_1301078859256460580_tmp}
27-07-2015 08:56:23 PDT foobar INFO - Working directory: /home/elad/azkaban-solo-2.5.0/executions/95/foobar
27-07-2015 08:56:33 PDT foobar INFO - INFO Running job foobar
27-07-2015 08:56:33 PDT foobar INFO - INFO Class name voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob
27-07-2015 08:56:33 PDT foobar INFO - INFO Constructor found public voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob(java.lang.String,azkaban.utils.Props)
27-07-2015 08:56:33 PDT foobar INFO - INFO Job props.toString(): {azkaban.flow.execid: 95, hadoop.home: , push.store.owners: foobar@example.com, azkaban.flow.start.timezone: America/Los_Angeles, azkaban.flow.start.hour: 08, azkaban.job.attachment.file: /home/elad/azkaban-solo-2.5.0/executions/95/foobar/_job.95.foobar.attach, avro.value.field: html, azkaban.flow.start.second: 22, hadoop.security.manager.class: azkaban.security.HadoopSecurityManager_H_1_0, build.input.path: hdfs://hadoop-cluster:8020/user/elad/sample_data/, build.required.reads: 1, type: java, azkaban.flow.start.year: 2015, push.cluster: tcp://localhost:6666, azkaban.flow.uuid: 027a6616-6ca5-4d3d-8118-c4c4b366c8f4, jobtype.class: azkaban.jobtype.JavaJob, plugin.dir: /home/elad/azkaban-solo-2.5.0/plugins/jobtypes/java, azkaban.flow.flowid: foobar, push.store.name: foobar, azkaban.should.proxy: false, azkaban.job.metadata.file: _job.95.foobar.meta, azkaban.flow.start.minute: 56, push.store.description: foobar, classpath: /home/elad/voldemort/voldemort-repo/build/libs/voldemort-repo-1.9.18-all.jar:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/hadoop-0.20-mapreduce/*:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/hadoop/client/*:/home/elad/voldemort/voldemort-repo/lib/*:/etc/hadoop/conf, azkaban.flow.start.milliseconds: 974, build.type.avro: true, build.replication.factor: 1, build.output.dir: hdfs://hadoop-cluster:8020/tmp, azkaban.flow.start.timestamp: 2015-07-27T08:56:22.974-07:00, azkaban.flow.start.month: 07, job.class: voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob, working.dir: /home/elad/azkaban-solo-2.5.0/executions/95/foobar, azkaban.flow.projectversion: 1, azkaban.job.attempt: 0, azkaban.flow.projectid: 113, azkaban.flow.start.day: 27, avro.key.field: name, user.to.proxy: elad, }
27-07-2015 08:56:34 PDT foobar INFO - INFO Client zone-id [-1] Attempting to get raw store [voldsys$_metadata_version_persistence] 
27-07-2015 08:56:34 PDT foobar INFO - INFO Client zone-id [-1] Attempting to get raw store [voldsys$_store_quotas] 
27-07-2015 08:56:34 PDT foobar INFO - INFO voldemort.fetcher.protocol is set to : hftp
27-07-2015 08:56:34 PDT foobar INFO - INFO voldemort.fetcher.port is set to : 50070
27-07-2015 08:56:34 PDT foobar INFO - INFO Build and Push Job constructed for 1 cluster(s).
27-07-2015 08:56:34 PDT foobar INFO - INFO Invoking method run
27-07-2015 08:56:34 PDT foobar INFO - INFO Proxy check failed, not proxying run.
27-07-2015 08:56:34 PDT foobar INFO - INFO Requesting Compression Codec expected by Server
27-07-2015 08:56:34 PDT foobar INFO - INFO Server responded with compression codecs: [ NO_CODEC ]
27-07-2015 08:56:34 PDT foobar INFO - INFO Using no compression
27-07-2015 08:56:35 PDT foobar ERROR - SLF4J: Class path contains multiple SLF4J bindings.
27-07-2015 08:56:35 PDT foobar ERROR - SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
27-07-2015 08:56:35 PDT foobar ERROR - SLF4J: Found binding in [jar:file:/home/elad/voldemort/voldemort-repo/lib/slf4j-log4j12-1.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
27-07-2015 08:56:35 PDT foobar ERROR - SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
27-07-2015 08:56:35 PDT foobar ERROR - SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
27-07-2015 08:56:35 PDT foobar INFO - WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
27-07-2015 08:56:37 PDT foobar INFO - INFO Verifying store: 
27-07-2015 08:56:37 PDT foobar INFO - 
27-07-2015 08:56:37 PDT foobar INFO -   foobar
27-07-2015 08:56:37 PDT foobar INFO -   read-only
27-07-2015 08:56:37 PDT foobar INFO -   foobar
27-07-2015 08:56:37 PDT foobar INFO -   foobar@example.com
27-07-2015 08:56:37 PDT foobar INFO -   client
27-07-2015 08:56:37 PDT foobar INFO -   1
27-07-2015 08:56:37 PDT foobar INFO -   1
27-07-2015 08:56:37 PDT foobar INFO -   1
27-07-2015 08:56:37 PDT foobar INFO -   
27-07-2015 08:56:37 PDT foobar INFO -       avro-generic
27-07-2015 08:56:37 PDT foobar INFO -       {"type":"string","avro.java.string":"String"}
27-07-2015 08:56:37 PDT foobar INFO -   
27-07-2015 08:56:37 PDT foobar INFO -   
27-07-2015 08:56:37 PDT foobar INFO -       avro-generic
27-07-2015 08:56:37 PDT foobar INFO -       {"type":"string","avro.java.string":"String"}
27-07-2015 08:56:37 PDT foobar INFO -   
27-07-2015 08:56:37 PDT foobar INFO - 
27-07-2015 08:56:37 PDT foobar INFO - INFO Getting store definition from: tcp://localhost:6666 (node id 0)
27-07-2015 08:56:37 PDT foobar INFO - INFO Could not find store foobar on Voldemort. Adding it to all nodes 
27-07-2015 08:56:37 PDT foobar INFO - INFO Validating schema for store:  foobar
27-07-2015 08:56:37 PDT foobar INFO - INFO Adding store foobar on node localhost:0
27-07-2015 08:56:38 PDT foobar INFO - INFO Succesfully added foobar on node localhost:0
27-07-2015 08:56:38 PDT foobar INFO - INFO Validating schema for store:  foobar
27-07-2015 08:56:38 PDT foobar INFO - INFO Adding store foobar on node localhost:1
27-07-2015 08:56:39 PDT foobar INFO - INFO Succesfully added foobar on node localhost:1
27-07-2015 08:56:39 PDT foobar INFO - INFO findContainingJar finds url:jar:file:/home/elad/voldemort/voldemort-repo/build/libs/voldemort-repo-1.9.18-all.jar!/voldemort/store/readonly/mr/azkaban/VoldemortStoreBuilderJob.class
27-07-2015 08:56:39 PDT foobar INFO - INFO Setting hadoop jar file for class:class voldemort.store.readonly.mr.azkaban.VoldemortStoreBuilderJob  to /home/elad/voldemort/voldemort-repo/build/libs/voldemort-repo-1.9.18-all.jar
27-07-2015 08:56:39 PDT foobar INFO - INFO *************************************************************************
27-07-2015 08:56:39 PDT foobar INFO - INFO           Running on Real Hadoop Cluster(local)           
27-07-2015 08:56:39 PDT foobar INFO - INFO *************************************************************************
27-07-2015 08:56:39 PDT foobar INFO - INFO HadoopStoreBuilder constructed with numChunks <= 0, thus relying chunk size.
27-07-2015 08:56:39 PDT foobar INFO - INFO dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
27-07-2015 08:56:39 PDT foobar INFO - INFO Data size = 541945, replication factor = 1, numNodes = 2, chunk size = 1073741824
27-07-2015 08:56:39 PDT foobar INFO - INFO Number of chunks: 1, number of reducers: 4, save keys: true, reducerPerBucket: false
27-07-2015 08:56:39 PDT foobar INFO - INFO Building store...
27-07-2015 08:56:39 PDT foobar INFO - INFO session.id is deprecated. Instead, use dfs.metrics.session-id
27-07-2015 08:56:39 PDT foobar INFO - INFO Initializing JVM Metrics with processName=JobTracker, sessionId=
27-07-2015 08:56:39 PDT foobar INFO - WARN Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
27-07-2015 08:56:39 PDT foobar INFO - INFO Total input paths to process : 1
27-07-2015 08:56:40 PDT foobar INFO - INFO OutputCommitter set in config null
27-07-2015 08:56:40 PDT foobar INFO - INFO Running job: job_local321655457_0001
27-07-2015 08:56:40 PDT foobar INFO - INFO OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
27-07-2015 08:56:40 PDT foobar INFO - INFO Waiting for map tasks
27-07-2015 08:56:40 PDT foobar INFO - INFO Starting task: attempt_local321655457_0001_m_000000_0
27-07-2015 08:56:40 PDT foobar INFO - WARN Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
27-07-2015 08:56:40 PDT foobar INFO - INFO setsid exited with exit code 0
27-07-2015 08:56:40 PDT foobar INFO - INFO  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@57e09dbc
27-07-2015 08:56:40 PDT foobar INFO - INFO Processing split: hdfs://hadoop-cluster:8020/user/elad/sample_data/foobar.avro:0+541945
27-07-2015 08:56:40 PDT foobar INFO - WARN Counter name MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group name and  BYTES_READ as counter name instead
27-07-2015 08:56:40 PDT foobar INFO - INFO numReduceTasks: 1
27-07-2015 08:56:40 PDT foobar INFO - INFO Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
27-07-2015 08:56:40 PDT foobar INFO - INFO io.sort.mb = 100
27-07-2015 08:56:40 PDT foobar INFO - INFO data buffer = 79691776/99614720
27-07-2015 08:56:40 PDT foobar INFO - INFO record buffer = 262144/327680
27-07-2015 08:56:41 PDT foobar INFO - INFO Starting flush of map output
27-07-2015 08:56:41 PDT foobar INFO - INFO Finished spill 0
27-07-2015 08:56:41 PDT foobar INFO - INFO Task:attempt_local321655457_0001_m_000000_0 is done. And is in the process of commiting
27-07-2015 08:56:41 PDT foobar INFO - INFO hdfs://hadoop-cluster:8020/user/elad/sample_data/foobar.avro:0+541945
27-07-2015 08:56:41 PDT foobar INFO - INFO Task 'attempt_local321655457_0001_m_000000_0' done.
27-07-2015 08:56:41 PDT foobar INFO - INFO Finishing task: attempt_local321655457_0001_m_000000_0
27-07-2015 08:56:41 PDT foobar INFO - INFO Map task executor complete.
27-07-2015 08:56:41 PDT foobar INFO - WARN Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
27-07-2015 08:56:41 PDT foobar INFO - INFO  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@390c7705
27-07-2015 08:56:41 PDT foobar INFO - INFO 
27-07-2015 08:56:41 PDT foobar INFO - INFO Merging 1 sorted segments
27-07-2015 08:56:41 PDT foobar INFO - INFO Down to the last merge-pass, with 1 segments left of total size: 3938021 bytes
27-07-2015 08:56:41 PDT foobar INFO - INFO 
27-07-2015 08:56:41 PDT foobar INFO - INFO  map 100% reduce 0%
27-07-2015 08:56:42 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/vold-build-and-push-447760949625925819/foobar.attempt_local321655457_0001_r_000000_0.index
27-07-2015 08:56:42 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/vold-build-and-push-447760949625925819/foobar.attempt_local321655457_0001_r_000000_0.data
27-07-2015 08:56:42 PDT foobar INFO - INFO Opening hdfs://hadoop-cluster:8020/tmp/vold-build-and-push-447760949625925819/foobar.attempt_local321655457_0001_r_000000_0.index and hdfs://hadoop-cluster:8020/tmp/vold-build-and-push-447760949625925819/foobar.attempt_local321655457_0001_r_000000_0.data for writing.
27-07-2015 08:56:46 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/localhost/node-1
27-07-2015 08:56:47 PDT foobar INFO - INFO reduce > reduce
27-07-2015 08:56:47 PDT foobar INFO - INFO  map 100% reduce 100%
27-07-2015 08:56:49 PDT foobar INFO - INFO Moving hdfs://hadoop-cluster:8020/tmp/vold-build-and-push-447760949625925819/foobar.attempt_local321655457_0001_r_000000_0.index to hdfs://hadoop-cluster:8020/tmp/localhost/node-1/3_0_0.index
27-07-2015 08:56:50 PDT foobar INFO - INFO Moving hdfs://hadoop-cluster:8020/tmp/vold-build-and-push-447760949625925819/foobar.attempt_local321655457_0001_r_000000_0.data to hdfs://hadoop-cluster:8020/tmp/localhost/node-1/3_0_0.data
27-07-2015 08:56:50 PDT foobar INFO - INFO Task:attempt_local321655457_0001_r_000000_0 is done. And is in the process of commiting
27-07-2015 08:56:50 PDT foobar INFO - INFO reduce > reduce
27-07-2015 08:56:50 PDT foobar INFO - INFO Task attempt_local321655457_0001_r_000000_0 is allowed to commit now
27-07-2015 08:56:50 PDT foobar INFO - INFO Saved output of task 'attempt_local321655457_0001_r_000000_0' to hdfs://hadoop-cluster:8020/tmp/vold-build-and-push-447760949625925819
27-07-2015 08:56:50 PDT foobar INFO - INFO reduce > reduce
27-07-2015 08:56:50 PDT foobar INFO - INFO Task 'attempt_local321655457_0001_r_000000_0' done.
27-07-2015 08:56:51 PDT foobar INFO - INFO Job complete: job_local321655457_0001
27-07-2015 08:56:51 PDT foobar INFO - INFO Counters: 26
27-07-2015 08:56:51 PDT foobar INFO - INFO   File System Counters
27-07-2015 08:56:51 PDT foobar INFO - INFO     FILE: Number of bytes read=19400411
27-07-2015 08:56:51 PDT foobar INFO - INFO     FILE: Number of bytes written=23723064
27-07-2015 08:56:51 PDT foobar INFO - INFO     FILE: Number of read operations=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     FILE: Number of large read operations=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     FILE: Number of write operations=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     HDFS: Number of bytes read=1211906
27-07-2015 08:56:51 PDT foobar INFO - INFO     HDFS: Number of bytes written=3937154
27-07-2015 08:56:51 PDT foobar INFO - INFO     HDFS: Number of read operations=34
27-07-2015 08:56:51 PDT foobar INFO - INFO     HDFS: Number of large read operations=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     HDFS: Number of write operations=18
27-07-2015 08:56:51 PDT foobar INFO - INFO   Map-Reduce Framework
27-07-2015 08:56:51 PDT foobar INFO - INFO     Map input records=100
27-07-2015 08:56:51 PDT foobar INFO - INFO     Map output records=100
27-07-2015 08:56:51 PDT foobar INFO - INFO     Map output bytes=3937610
27-07-2015 08:56:51 PDT foobar INFO - INFO     Input split bytes=143
27-07-2015 08:56:51 PDT foobar INFO - INFO     Combine input records=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     Combine output records=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     Reduce input groups=100
27-07-2015 08:56:51 PDT foobar INFO - INFO     Reduce shuffle bytes=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     Reduce input records=100
27-07-2015 08:56:51 PDT foobar INFO - INFO     Reduce output records=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     Spilled Records=200
27-07-2015 08:56:51 PDT foobar INFO - INFO     CPU time spent (ms)=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     Physical memory (bytes) snapshot=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     Virtual memory (bytes) snapshot=0
27-07-2015 08:56:51 PDT foobar INFO - INFO     Total committed heap usage (bytes)=361103360
27-07-2015 08:56:51 PDT foobar INFO - INFO   File Input Format Counters 
27-07-2015 08:56:51 PDT foobar INFO - INFO     Bytes Read=541643
27-07-2015 08:56:51 PDT foobar INFO - INFO Number of collisions in the job - 0
27-07-2015 08:56:51 PDT foobar INFO - INFO Maximum number of collisions for one entry - 0
27-07-2015 08:56:51 PDT foobar INFO - INFO No data generated for node 0. Generating empty folder
27-07-2015 08:56:51 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/localhost/node-0
27-07-2015 08:56:51 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/localhost/node-0/.metadata
27-07-2015 08:56:54 PDT foobar INFO - INFO Checksum for node 1 - a6e56d4b5fe7b55e12d108f3f3c4f469
27-07-2015 08:56:54 PDT foobar INFO - INFO Setting permission to 755 for hdfs://hadoop-cluster:8020/tmp/localhost/node-1/.metadata
27-07-2015 08:56:54 PDT foobar INFO - INFO Pushing to cluster URL: tcp://localhost:6666
27-07-2015 08:56:54 PDT foobar INFO - INFO StorePushTask.call() invoked for URL: tcp://localhost:6666
27-07-2015 08:56:54 PDT foobar INFO - WARN The server requested pushHighAvailability to be DISABLED on cluster: tcp://localhost:6666
27-07-2015 08:56:54 PDT foobar INFO - INFO Push starting for cluster: tcp://localhost:6666
27-07-2015 08:56:54 PDT foobar INFO - INFO tcp://localhost:6666 : Existing protocol = hdfs and port = 8020
27-07-2015 08:56:54 PDT foobar INFO - INFO tcp://localhost:6666 : New protocol = hftp and port = 50070
27-07-2015 08:56:54 PDT foobar INFO - INFO Client zone-id [-1] Attempting to get raw store [voldsys$_metadata_version_persistence] 
27-07-2015 08:56:54 PDT foobar INFO - INFO Client zone-id [-1] Attempting to get raw store [voldsys$_store_quotas] 
27-07-2015 08:56:54 PDT foobar INFO - INFO tcp://localhost:6666 : Initiating swap of foobar with dataDir: hdfs://hadoop-cluster:8020/tmp/localhost
27-07-2015 08:56:54 PDT foobar INFO - INFO tcp://localhost:6666 : Invoking fetch for node localhost [id 0] for hftp://hadoop-cluster:50070/tmp/localhost/node-0
27-07-2015 08:56:54 PDT foobar INFO - INFO tcp://localhost:6666 : Invoking fetch for node localhost [id 1] for hftp://hadoop-cluster:50070/tmp/localhost/node-1
27-07-2015 08:56:54 PDT foobar INFO - INFO node localhost [id 1] : AsyncOperationStatus(task id = 0, description = Fetch store, complete = false, status = 0 MB copied at 0 MB/sec - 0 % complete)
27-07-2015 08:56:54 PDT foobar INFO - INFO node localhost [id 0] : AsyncOperationStatus(task id = 0, description = Fetch store, complete = false, status = 0 MB copied at 0 MB/sec - 0 % complete)
27-07-2015 08:56:56 PDT foobar INFO - INFO node localhost [id 1] : AsyncOperationStatus(task id = 0, description = Fetch store, complete = true, status = /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1)
27-07-2015 08:56:56 PDT foobar INFO - INFO node localhost [id 1] : AsyncOperationStatus(task id = 0, description = Fetch store, complete = true, status = /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1)
27-07-2015 08:56:56 PDT foobar INFO - INFO tcp://localhost:6666 : Fetch succeeded on node localhost [id 1]
27-07-2015 08:56:56 PDT foobar INFO - INFO node localhost [id 0] : AsyncOperationStatus(task id = 0, description = Fetch store, complete = true, status = /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1)
27-07-2015 08:56:56 PDT foobar INFO - INFO node localhost [id 0] : AsyncOperationStatus(task id = 0, description = Fetch store, complete = true, status = /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1)
27-07-2015 08:56:56 PDT foobar INFO - INFO tcp://localhost:6666 : Fetch succeeded on node localhost [id 0]
27-07-2015 08:56:56 PDT foobar INFO - INFO tcp://localhost:6666 : Attempting swap for node localhost [id 0], dir = /home/elad/voldemort/two_node_cluster/node_0/data/read-only/foobar/version-1
27-07-2015 08:56:56 PDT foobar INFO - INFO tcp://localhost:6666 : Swap succeeded for node localhost [id 0]
27-07-2015 08:56:56 PDT foobar INFO - INFO tcp://localhost:6666 : Attempting swap for node localhost [id 1], dir = /home/elad/voldemort/two_node_cluster/node_1/data/read-only/foobar/version-1
27-07-2015 08:56:56 PDT foobar INFO - INFO tcp://localhost:6666 : Swap succeeded for node localhost [id 1]
27-07-2015 08:56:56 PDT foobar INFO - INFO tcp://localhost:6666 : Swap complete.
27-07-2015 08:56:56 PDT foobar INFO - INFO StorePushTask.call() finished for URL: tcp://localhost:6666
27-07-2015 08:56:56 PDT foobar INFO - INFO Successfully pushed to URL: tcp://localhost:6666
27-07-2015 08:56:56 PDT foobar INFO - INFO Informing about delete start ...hdfs://hadoop-cluster:8020/tmp/localhost
27-07-2015 08:56:57 PDT foobar INFO - INFO Deleted hdfs://hadoop-cluster:8020/tmp/localhost
27-07-2015 08:56:57 PDT foobar INFO - INFO Closing AdminClient with mainBootstrapUrl: tcp://localhost:6666
27-07-2015 08:56:57 PDT foobar INFO - INFO Outputting generated properties to /home/elad/azkaban-solo-2.5.0/executions/95/foobar/foobar_output_1301078859256460580_tmp
27-07-2015 08:56:57 PDT foobar INFO - Process completed successfully in 34 seconds.
27-07-2015 08:56:57 PDT foobar INFO - Finishing job foobar at 1438012617823 with status SUCCEEDED
elad commented 9 years ago

By the way. I just tried setting up a three-node and four-node Voldemort cluster. I get the same results. Only one node (node 1 for the three-node cluster and node 0 for the four-node cluster) get data. No data is generated for all other nodes. Even if I bump the replication factor to be equal to the number of nodes no data is generated for them.

One thing that may be related is that this log line:

Setting permission to 755 for hdfs://hadoop-cluster:8020/user/elad/tmp/localhost/node-0

That I assume sets up permissions for the per-node data directory, is printed before the summary for the node that receives the data, but after the summary for all other nodes. So maybe the Hadoop job can't even write the data for other nodes and simply doesn't complain...?

Edit: Another data point: I was wrong in my earlier assessment that single-node cluster worked. When I try a single-node cluster with one or two partitions it works, however, when the number of partitions is three or more, I start losing massive amounts of data. So unless I'm doing something very wrong, there might be a problem with hashing and dividing to partitions.

FelixGV commented 9 years ago

You might want to set build.output.keep=true if you want to see what's going on in the build phase of the job. This way it won't delete the output directory at the end.

elad commented 9 years ago

I think I got it. Look at this:

27-07-2015 08:56:39 PDT foobar INFO - INFO *************************************************************************
27-07-2015 08:56:39 PDT foobar INFO - INFO           Running on Real Hadoop Cluster(local)           
27-07-2015 08:56:39 PDT foobar INFO - INFO *************************************************************************
[...]
27-07-2015 08:56:39 PDT foobar INFO - INFO Number of chunks: 1, number of reducers: 4, save keys: true, reducerPerBucket: false
[...]
27-07-2015 08:56:40 PDT foobar INFO - INFO numReduceTasks: 1

It looks like for some reason Hadoop is running in local mode which, among other things, forces the number of reducers to be 1. I'm pretty sure if I can fix that, everything should work.

How do I do that? I set hadoop.home in the job file and in azkaban's properties to /etc/hadoop, but it still thinks it's running in local mode. The jobs don't show up in the job tracker/history either. What should I do to have azkaban talk to my Hadoop cluster? :)

elad commented 9 years ago

Following up--

I'm now trying to run build-and-push without Azkaban.

First, I had to modify run-bnp.sh to explicitly call hadoop ... to have it run on the cluster. Without that change, it insisted on running locally despite being provided /etc/hadoop as HADOOP_HOME. Then, I modified CLASSPATH to be HADOOP_CLASSPATH.

Now I'm coming across issues that apparently stem from some jars not being found on the remote machines. I couldn't figure out how to pass -libjars (to populate the distributed cache) because VoldemortBuildAndPushJobRunner.java doesn't use ToolRunner (it's implemented as an Azkaban job...) so whenever I do try to pass -libjars it handles the option incorrectly. :/

So even though the jars are all accessible from all of the machines (they're NFS mounted) I get errors like this one:

Error: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
    ... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
    ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
    ... 17 more
Caused by: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
    at voldemort.org.apache.avro.mapred.HadoopMapper.configure(HadoopMapper.java:43)
    ... 22 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
    ... 25 more
Caused by: java.lang.NoClassDefFoundError: org/jdom/Content
    at voldemort.store.readonly.mr.AvroStoreBuilderMapper.configure(AvroStoreBuilderMapper.java:208)
    ... 30 more
Caused by: java.lang.ClassNotFoundException: org.jdom.Content
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 31 more

Any clues on what I might be doing wrong?

FelixGV commented 9 years ago

Hmm...

Thanks for following up on the run_bnp.sh script. I didn't suspect Azkaban did anything special that enabled the job to run properly on the cluster (rather than in local mode...).

Ideally, I think we should strive to get to a point where we can build a fat jar that has all the required dependencies, protobuf and avro shaded to Voldemort's versions, and doesn't require any special setup on the Hadoop cluster.

Can you inspect if the jar you're generating has jdom in it?

-F

FelixGV commented 9 years ago

The BnP job does have some code to add stuff to the distributed cache, BTW. It can be controlled by setting the hadoop.external.jarFiles config to a comma-separated list of HDFS paths.

Using the distributed cache may be more optimal in the long run, since then there's less jars to spool around on each subsequent runs, but it goes against the goal of running out of the box with as little Hadoop-side setup as possible. Nonetheless, if it can unblock you for now, that's cool.

-F

elad commented 9 years ago

Thanks for following up on the run_bnp.sh script. I didn't suspect Azkaban did anything special that enabled the job to run properly on the cluster (rather than in local mode...).

What I meant is that many Hadoop jobs are implemented using ToolRunner. That lets you both pass flags for the program itself (like a configuration file) but also flags to the JVM as well (like -D flags). Is that possible with BnP? Every combination I tried insisted on parsing the stuff as arguments to the program itself rather than JVM.

Ideally, I think we should strive to get to a point where we can build a fat jar that has all the required dependencies, protobuf and avro shaded to Voldemort's versions, and doesn't require any special setup on the Hadoop cluster.

Can you inspect if the jar you're generating has jdom in it?

It doesn't. I only shade protobuf (to 2.3.0) and avro (to 1.4.0). Everything else should be provided by the cluster. I think that makes sense because that's how other Hadoop jobs run - some are in fat jars, others use libraries provided by the cluster, and some shade only a few jars.

The BnP job does have some code to add stuff to the distributed cache, BTW. It can be controlled by setting the hadoop.external.jarFiles config to a comma-separated list of HDFS paths.

That's good to know, I missed that option. I'll give it a try!

Using the distributed cache may be more optimal in the long run, since then there's less jars to spool around on each subsequent runs, but it goes against the goal of running out of the box with as little Hadoop-side setup as possible. Nonetheless, if it can unblock you for now, that's cool.

So far it's anything but running out of the box ;) but I hope that once I get things working (in a non-LinkedIn environment) we can figure out the minimum necessary steps and document them for others.

(For reference, I've had a chance of working with another tool (hdfs2cass) that loaded stuff into Cassandra. Other than the outdated dependencies it was relatively easy to get going.)

Anyway - I'll try using hadoop.external.jarFiles and report back.

FelixGV commented 9 years ago

Yeah I know Voldemort RO is not straightforward to get up and running... it seems like for a while a lot of stuff went into it that didn't get much testing outside of LinkedIn, which is unfortunate. It'd be nice to whip it back into shape so it's easy to get up and running with minimal set up (no Azkaban, etc.).

If you think it'd help to change my naive VoldemortBuildAndPushRunner implementation so that it extends ToolRunner instead, by all means, go for it (:

-F

elad commented 9 years ago

Yeah I know Voldemort RO is not straightforward to get up and running... it seems like for a while a lot of stuff went into it that didn't get much testing outside of LinkedIn, which is unfortunate. It'd be nice to whip it back into shape so it's easy to get up and running with minimal set up (no Azkaban, etc.).

I'm more than happy to help with that, especially if LinkedIn is interested in having others use it as well. I hope it does because after all, it was open sourced. :)

If you think it'd help to change my naive VoldemortBuildAndPushRunner implementation so that it extends ToolRunner instead, by all means, go for it (:

I'm already planning on it.

Thanks for being so helpful and responsive, I definitely feel things are moving in the right direction!

elad commented 9 years ago

Hey @FelixGV,

I got it to work without hadoop.external.jarFiles or implementing BnP as a Tool. What I did instead was add jdom to the shaded jar:

shadowJar {
    from sourceSets.main.output, sourceSets.contrib.output, sourceSets.test.output, sourceSets.main.resources

    // Required when working in an Hadoop 2.x environment
    dependencies {
        include(dependency('org.apache.avro:avro:1.4.0'))
        include(dependency('com.google.protobuf:protobuf-java:2.3.0'))
        include(dependency('org.jdom:jdom:1.1'))
    }
    relocate 'com.google.protobuf', 'voldemort.com.google.protobuf'
    relocate 'org.apache.avro', 'voldemort.org.apache.avro'
}

(I figured adding another moving part by requiring some files to be on HDFS outside VOLDEMORT_HOME wasn't a good idea and reimplementing BnP as a Tool, although IMHO important, is at the moment beyond the scope of what I'm interested in doing - evaluating Voldemort.)

Now BnP works without Azkaban. :) Do you want to add the necessary bits to make it work? There are changes to the build script, your class and run-bnp.sh script, and maybe an example config file + instructions. I'll be more than happy to submit a pull request and perhaps write some up-to-date text on how to get it up and running.

arunthirupathi commented 9 years ago

Elad, We would love to get a pull request with the changes documented. It would help the future open source people about the possible issues. When time permits, we would also like to review the changes you made and would explore if things can be committed without breaking the backward compatibility.

Though we don't promise merging in, it is definitely a very valuable help for future members :)

Please let us know, if you have different thoughts/opinions ? We are open to suggestions.

Thanks, Arun.

On Sun, Aug 2, 2015 at 2:08 AM, Elad Efrat notifications@github.com wrote:

Hey @FelixGV https://github.com/FelixGV,

I got it to work without hadoop.external.jarFiles or implementing BnP as a Tool. What I did instead was add jdom to the shaded jar:

shadowJar { from sourceSets.main.output, sourceSets.contrib.output, sourceSets.test.output, sourceSets.main.resources

// Required when working in an Hadoop 2.x environment
dependencies {
    include(dependency('org.apache.avro:avro:1.4.0'))
    include(dependency('com.google.protobuf:protobuf-java:2.3.0'))
    include(dependency('org.jdom:jdom:1.1'))
}
relocate 'com.google.protobuf', 'voldemort.com.google.protobuf'
relocate 'org.apache.avro', 'voldemort.org.apache.avro'

}

(I figured adding another moving part by requiring some files to be on HDFS outside VOLDEMORT_HOME wasn't a good idea and reimplementing BnP as a Tool, although IMHO important, is at the moment beyond the scope of what I'm interested in doing - evaluating Voldemort.)

Now BnP works without Azkaban. :) Do you want to add the necessary bits to make it work? There are changes to the build script, your class and run-bnp.sh script, and maybe an example config file + instructions. I'll be more than happy to submit a pull request and perhaps write some up-to-date text on how to get it up and running.

— Reply to this email directly or view it on GitHub https://github.com/voldemort/voldemort/issues/284#issuecomment-127000769 .

FelixGV commented 9 years ago

Yeah, please send the PR in @elad, I'll review it (:

Awesome work!

-F

elad commented 9 years ago

Hey @arunthirupathi & @FelixGV, I just submitted a pull request (#292) for the changes required to make it work. Please take a look, the BnP runner scripts are new so shouldn't be breaking anything but the changes to build.gradle could use a review just to be on the safe side.

Next up is a short writeup with up-to-date examples on how to use it, and when I figure out how to rebalance without losing data I'll update those examples too.

FelixGV commented 9 years ago

That's awesome! I'll check it out. Hopefully today, but definitely some time this week.

FYI, I've been working on PR #289 which is unrelated to your stuff but which I'll probably merge in soon, if I don't see any issues in my test environment over the next couple of days. It upgrades the Hadoop lib to a modern version which contains many Kerberos-related fixes.

FelixGV commented 9 years ago

All right, I think this issue can be closed now... The shading build changes and BnP run script are merged into master and released as part of version 1.9.19.

Feel free to open new issues if you encounter any problem (:

-F

elad commented 9 years ago

Hey @FelixGV, sorry for the long silence - I got flooded with other, unrelated work just when I thought I'd have a chance to test things out with RF>1 and your patches. But I haven't forgotten, and I'll definitely follow up where appropriate and you can count on issues (and pull requests) coming your way if I find any problems! :) Thanks!

FelixGV commented 9 years ago

All right great (:

No worries about the silence. Happens even to the best of us ;)

Looking forward to hearing about your experimentations!

-F