HDFS High-availability does not working

joshj1806 commented 6 years ago

I tried to use this example file to deploy high-availability HDFS cluster. https://github.com/hortonworks/ansible-hortonworks/blob/master/playbooks/group_vars/example-hdp-ha-3-masters-with-storm-kafka , but it didn't work.

I got an error message Unable to fetch namespace information from active NN

After I changed https://github.com/hortonworks/ansible-hortonworks/blob/35785c7e88d069a920e581c90192da181e1496fe/playbooks/roles/ambari-blueprint/templates/blueprint_dynamic.j2#L443 to "dfs.ha.fencing.methods" : "sshfence", and added a line below "dfs.ha.fencing.ssh.private-key-files" : "/root/.ssh/id_rsa", and set up SSH tunneling between two namenode (Active and Standby),

it starts work.

alexandruanghel commented 6 years ago

Hi, sorry about my late reply!

You can of course use another fencing method and setup SSH private keys. However, the default is the "dummy" /bin/true, which should work fine and it gets rarely changed.

I have tested before the example-hdp-ha-3-masters-with-storm-kafka blueprint and it worked fine with the default fencing method so I find strange that you got this error, this should not happen.

Can you give me more information about your configuration (essentially all changes you made in the all file). There could be a strange combination of settings that triggers this issue.

Many thanks!

joshj1806 commented 6 years ago

---
###########################
## cluster configuration ##
###########################

cluster_name: 'TEST'

ambari_version: '2.6.2.2'                                 # must be the 4-part full version number

hdp_version: '2.6.5.0'                                    # must be the 4-part full version number
hdp_build_number: 'auto'                                  # the HDP build number from docs.hortonworks.com (if set to 'auto', Ansible will try to get it from the repository)

hdf_version: '3.1.2.0'                                    # must be the 4-part full version number
hdf_build_number: 'auto'                                  # the HDF build number from docs.hortonworks.com (if set to 'auto', Ansible will try to get it from the repository)

hdpsearch_version: '3.0.0'                                # must be the full version number
hdpsearch_build_number: '100'                             # the HDP Search build number from docs.hortonworks.com (hardcoded to 100 for the moment)

repo_base_url: 'http://public-repo-1.hortonworks.com'     # change this if using a Local Repository

###########################
## general configuration ##
###########################

external_dns: no                                         # set to yes to use the existing DNS (when no, it will update the /etc/hosts file - must be set to 'no' when using Azure)
disable_firewall: yes                                      # set to yes to disable the existing local firewall service (iptables, firewalld, ufw)

########################
## java configuration ##
########################

java: 'embedded'                                          # can be set to 'embedded', 'openjdk' or 'oraclejdk'
oraclejdk_options:                                        # only used when java is set to 'oraclejdk'
  base_folder: '/usr/java'                                # the folder where the Java package should be unpacked to
  tarball_location: '/tmp/jdk-8u171-linux-x64.tar.gz'     # the location of the tarball on the remote system or on the Ansible controller
  jce_location: '/tmp/jce_policy-8.zip'                   # the location of the JCE package on the remote system or on the Ansible controller
  remote_files: no                                        # set to yes to indicate the files are already on the remote systems, otherwise they will be copied by Ansible from the Ansible controller

############################
## database configuration ##
############################

database: 'embedded'                                      # can be set to 'embedded', 'postgres', 'mysql' or 'mariadb'
database_options:
  external_hostname: ''                                   # if this is empty, Ansible will install and prepare the databases on the ambari-server node
  ambari_db_name: 'ambari'
  ambari_db_username: 'ambari'
  ambari_db_password: 'bigdata'
  hive_db_name: 'hive'
  hive_db_username: 'hive'
  hive_db_password: 'hive'
  oozie_db_name: 'oozie'
  oozie_db_username: 'oozie'
  oozie_db_password: 'oozie'
  druid_db_name: 'druid'
  druid_db_username: 'druid'
  druid_db_password: 'druid'
  superset_db_name: 'superset'
  superset_db_username: 'superset'
  superset_db_password: 'superset'
  rangeradmin_db_name: 'ranger'
  rangeradmin_db_username: 'ranger'
  rangeradmin_db_password: 'ranger'
  rangerkms_db_name: 'rangerkms'
  rangerkms_db_username: 'rangerkms'
  rangerkms_db_password: 'rangerkms'
  registry_db_name: 'registry'
  registry_db_username: 'registry'
  registry_db_password: 'registry'
  streamline_db_name: 'streamline'
  streamline_db_username: 'streamline'
  streamline_db_password: 'streamline'

#####################################
## kerberos security configuration ##                     # useful if blueprint is dynamic, but can also be used to deploy the MIT KDC
#####################################

security: 'none'                                          # can be set to 'none', 'mit-kdc' or 'active-directory'
security_options:
  external_hostname: ''                                   # if this is empty, Ansible will install and prepare the MIT KDC on the Ambari node
  realm: 'EXAMPLE.COM'
  admin_principal: 'admin'                                # the Kerberos principal that has the permissions to create new users (don't append the realm)
  admin_password: "{{ default_password }}"
  kdc_master_key: "{{ default_password }}"                # only used when security is set to 'mit-kdc'
  ldap_url: 'ldaps://ad.example.com:636'                  # only used when security is set to 'active-directory'
  container_dn: 'OU=hadoop,DC=example,DC=com'             # only used when security is set to 'active-directory'
  http_authentication: yes                                # set to yes to enable HTTP authentication (SPNEGO)

##########################
## ranger configuration ##                                # only useful if blueprint is dynamic
##########################

ranger_options:                                           # only used if RANGER_ADMIN is part of the blueprint stack
  enable_plugins: yes                                     # set to 'yes' if the plugins should be enabled for all of the installed services

ranger_security_options:                                  # only used if RANGER_ADMIN is part of the blueprint stack
  ranger_admin_password: "{{ default_password }}"         # the password for the Ranger admin users (both admin and amb_ranger_admin)
  ranger_keyadmin_password: "{{ default_password }}"      # the password for the Ranger keyadmin user (will only be set in HDP3, in HDP2 it will remain the default keyadmin)
  kms_master_key_password: "{{ default_password }}"       # password used for encrypting the Master Key

##################################
## other security configuration ##                        # only useful if blueprint is dynamic
##################################

ambari_admin_password: 'admin'                            # the password for the Ambari admin user
default_password: 'AsdQwe123456'                          # a default password for all required passwords which are not specified in the blueprint

atlas_security_options:
  admin_password: "{{ default_password }}"                # the password for the Atlas admin user

knox_security_options:
  master_secret: "{{ default_password }}"                 # Knox Master Secret

nifi_security_options:
  encrypt_password: "{{ default_password }}"              # the password used to encrypt raw configuration values
  sensitive_props_key: "{{ default_password }}"           # the password used to encrypt any sensitive property values that are configured in processors

superset_security_options:
  secret_key: "{{ default_password }}"
  admin_password: "{{ default_password }}"                # the password for the Superset admin user

smartsense_security_options:
  admin_password: "{{ default_password }}"                # password for the Activity Explorer's Zeppelin admin user

logsearch_security_options:
  admin_password: "{{ default_password }}"                # the password for the Logsearch admin user

##########################
## ambari configuration ##
##########################

ambari_admin_user: 'admin'
ambari_admin_default_password: 'admin'                    # no need to change this (unless the Ambari default changes)
config_recommendation_strategy: 'NEVER_APPLY'             # choose between 'NEVER_APPLY', 'ONLY_STACK_DEFAULTS_APPLY', 'ALWAYS_APPLY', 'ALWAYS_APPLY_DONT_OVERRIDE_CUSTOM_VALUES'

smartsense:                                               # Hortonworks subscription details (can be left empty if there is no subscription)
  id: ''
  account_name: ''
  customer_email: ''

wait: true                                                # wait for the cluster to finish installing
wait_timeout: 3600                                        # 60 minutes
accept_gpl: yes                                           # set to yes to allow Ambari to install GPL licensed libraries

cluster_template_file: 'cluster_template.j2'              # the cluster creation template file

#############################
## blueprint configuration ##
#############################

blueprint_name: '{{ cluster_name }}_blueprint'            # the name of the blueprint as it will be stored in Ambari
blueprint_file: 'blueprint_dynamic.j2'                    # the blueprint JSON file - 'blueprint_dynamic.j2' is a Jinja2 template that generates the required JSON
blueprint_dynamic:                                              # properties for the dynamic blueprint - these are only used by the 'blueprint_dynamic.j2' template to generate the JSON
  - host_group: "management"
    clients: ['ZOOKEEPER_CLIENT', 'HDFS_CLIENT', 'YARN_CLIENT', 'MAPREDUCE2_CLIENT', 'TEZ_CLIENT', 'SLIDER', 'PIG', 'SQOOP', 'HIVE_CLIENT', 'HCAT', 'INFRA_SOLR_CLIENT', 'SPARK2_CLIENT']
    services:
      - ZOOKEEPER_SERVER
      - JOURNALNODE
      - AMBARI_SERVER
      - INFRA_SOLR
      - ZEPPELIN_MASTER
      - APP_TIMELINE_SERVER
      - SPARK2_JOBHISTORYSERVER
      - HISTORYSERVER
      - HST_SERVER
      - HST_AGENT
      - METRICS_COLLECTOR
      - METRICS_GRAFANA
      - METRICS_MONITOR
  - host_group: "namenode01"
    clients: ['ZOOKEEPER_CLIENT', 'HDFS_CLIENT', 'YARN_CLIENT', 'MAPREDUCE2_CLIENT', 'TEZ_CLIENT', 'SLIDER', 'PIG', 'SQOOP', 'HIVE_CLIENT', 'HCAT', 'INFRA_SOLR_CLIENT', 'SPARK2_CLIENT']
    services:
      - ZOOKEEPER_SERVER
      - NAMENODE
      - ZKFC
      - JOURNALNODE
      - RESOURCEMANAGER
      - HIVE_SERVER
      - HIVE_METASTORE
      - NIMBUS
      - DRPC_SERVER
      - STORM_UI_SERVER
      - HST_AGENT
      - METRICS_MONITOR
  - host_group: "namenode02"
    clients: ['ZOOKEEPER_CLIENT', 'HDFS_CLIENT', 'YARN_CLIENT', 'MAPREDUCE2_CLIENT', 'TEZ_CLIENT', 'SLIDER', 'PIG', 'SQOOP', 'HIVE_CLIENT', 'HCAT', 'INFRA_SOLR_CLIENT', 'SPARK2_CLIENT']
    services:
      - ZOOKEEPER_SERVER
      - NAMENODE
      - ZKFC
      - JOURNALNODE
      - RESOURCEMANAGER
      - HIVE_SERVER
      - HIVE_METASTORE
      - WEBHCAT_SERVER
      - HST_AGENT
      - METRICS_MONITOR
  - host_group: "datanode01"
    clients: ['ZOOKEEPER_CLIENT', 'HDFS_CLIENT', 'YARN_CLIENT', 'MAPREDUCE2_CLIENT', 'TEZ_CLIENT', 'SLIDER', 'PIG', 'SQOOP', 'HIVE_CLIENT', 'HCAT', 'INFRA_SOLR_CLIENT', 'SPARK2_CLIENT']
    services:
      - DATANODE
      - NODEMANAGER
      - HST_AGENT
      - METRICS_MONITOR
  - host_group: "datanode02"
    clients: ['ZOOKEEPER_CLIENT', 'HDFS_CLIENT', 'YARN_CLIENT', 'MAPREDUCE2_CLIENT', 'TEZ_CLIENT', 'SLIDER', 'PIG', 'SQOOP', 'HIVE_CLIENT', 'HCAT', 'INFRA_SOLR_CLIENT', 'SPARK2_CLIENT']
    services:
      - KAFKA_BROKER
      - DATANODE
      - NODEMANAGER
      - SUPERVISOR
      - HST_AGENT
      - METRICS_MONITOR

############################
## helper variables ##                                    # don't change these unless there is a good reason
############################

hdp_minor_version: "{{ hdp_version | regex_replace('.[0-9]+.[0-9]+[0-9_-]*$','') }}"
hdp_major_version: "{{ hdp_minor_version.split('.').0 }}"
hdf_minor_version: "{{ hdf_version | regex_replace('.[0-9]+.[0-9]+[0-9_-]*$','') }}"
hdf_major_version: "{{ hdf_minor_version.split('.').0 }}"
utils_version: "{{ '1.1.0.20' if hdp_minor_version is version_compare('2.5', '<') else ('1.1.0.21' if hdp_version is version_compare('2.6.4', '<') else '1.1.0.22' ) }}"
hdfs_ha_name: "{{ cluster_name | regex_replace('_','-') }}"

I installed 5 node cluster.

Thanks,

alexandruanghel commented 6 years ago

Hi, I've just tested with your all file and it works fine (default centos7 AMI on AWS).

So my guess is that it's an environmental issue (probably /bin/true doesn't exist or it's not accessible to the hdfs user, which would be very strange).

Are you using any standard AMIs from the clouds or this is a deployment on a custom OS image? Can you check if su - hdfs -c /bin/true; echo $? works and it shows 0?

joshj1806 commented 6 years ago

Thanks for testing!

I tested on my own cluster with Centos7 images, yes, su -hdfs -c /bin/true; echo $? prints 0

I checked this on my 2 namenodes as well as the host runs the ansible script.

I suspect that somehow your namenodes can communicate with each other to retrieve the namespace.

hortonworks / ansible-hortonworks

HDFS High-availability does not working #45