iterative / dvc-hdfs

HDFS/WebHDFS plugin for dvc
https://dvc.org/doc/user-guide/data-management/remote-storage/hdfs
Apache License 2.0
0 stars 1 forks source link

DVC tool to push data to HDFS #18

Open subodhdere opened 1 year ago

subodhdere commented 1 year ago

Bug Report

Issue name

DVC tool to push data to HDFS

dvc push -r myremote -v

Description

Getting below error while pushing data to HDFS.

singhab@jupyter-singhab-jupyter:~/dvc-example$ dvc push -r myremote -v
2023-06-20 07:12:42,730 DEBUG: v2.58.2 (conda), CPython 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
2023-06-20 07:12:42,730 DEBUG: command: /opt/conda/bin/dvc push -r myremote -v
2023-06-20 07:12:43,153 DEBUG: Preparing to transfer data from '/home/singhab/dvc-example/.dvc/cache' to '/user/halo/dvc-usecase'
2023-06-20 07:12:43,153 DEBUG: Preparing to collect status from '/user/halo/dvc-usecase'
2023-06-20 07:12:43,154 DEBUG: Collecting status from '/user/halo/dvc-usecase'
2023-06-20 07:12:43,155 DEBUG: Querying 1 oids via object_exists                                                                                    
  0% Checking cache in '/user/halo/dvc-usecase'|                                                                         |0/? [00:00<?,    ?files/s]loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://<name-node>, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_1000, userName=singhab@HALO-TELEKOM.COM) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
/arrow/cpp/src/arrow/status.cc:137: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed. Detail: [errno 9] Bad file descriptor
2023-06-20 07:12:43,435 ERROR: unexpected error - HDFS connection failed                                                                            
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()

Reproduce

dvc push -r myremote -v

Expected

Data should be pushed to HDFS.

Output of dvc doctor:

$ dvc doctor

Additional Information (if any):

efiop commented 1 year ago

Hi @subodhdere , please post output of dvc doctor

Also, the log you've posted seems to be partial. Please post full log.

subodhdere commented 1 year ago

Hello, PSB logs of dvc doctor

singhab@jupyter-singhab-jupyter:~$ dvc doctor
DVC version: 2.58.2 (conda)
---------------------------
Platform: Python 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.22.0
        dvc_render = 0.5.3
        dvc_task = 0.2.1
        scmrepo = 1.0.3
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Config:
        Global: /home/singhab/.config/dvc
        System: /etc/xdg/dvc
subodhdere commented 1 year ago

Adding full logs.

singhab@jupyter-singhab-jupyter:~/dvc-example$ dvc push -r myremote -v
2023-06-29 10:42:46,833 DEBUG: v2.58.2 (conda), CPython 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
2023-06-29 10:42:46,833 DEBUG: command: /opt/conda/bin/dvc push -r myremote -v
2023-06-29 10:42:47,284 DEBUG: Preparing to transfer data from '/home/singhab/dvc-example/.dvc/cache' to '/user/halo/dvc-usecase'
2023-06-29 10:42:47,284 DEBUG: Preparing to collect status from '/user/halo/dvc-usecase'
2023-06-29 10:42:47,284 DEBUG: Collecting status from '/user/halo/dvc-usecase'
2023-06-29 10:42:47,285 DEBUG: Querying 1 oids via object_exists                                                                                    
  0% Checking cache in '/user/halo/dvc-usecase'|                                                                         |0/? [00:00<?,    ?files/s]loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://am01.halo-telekom.com, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_1000, userName=singhab@HALO-TELEKOM.COM) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
/arrow/cpp/src/arrow/status.cc:137: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed. Detail: [errno 9] Bad file descriptor
2023-06-29 10:42:47,541 ERROR: unexpected error - HDFS connection failed                                                                            
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/opt/conda/lib/python3.8/site-packages/dvc/commands/data_sync.py", line 60, in run
    processed_files_count = self.repo.push(
  File "/opt/conda/lib/python3.8/site-packages/dvc/repo/__init__.py", line 65, in wrapper
    return f(repo, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dvc/repo/push.py", line 92, in push
    result = self.cloud.push(
  File "/opt/conda/lib/python3.8/site-packages/dvc/data_cloud.py", line 154, in push
    return self.transfer(
  File "/opt/conda/lib/python3.8/site-packages/dvc/data_cloud.py", line 135, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 178, in compare_status
    dest_exists, dest_missing = status(
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 149, in status
    odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback)
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 406, in oids_exist
    return list(wrap_iter(remote_oids, callback))
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 36, in wrap_iter
    for index, item in enumerate(iterable, start=1):
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 354, in list_oids_exists
    in_remote = self.fs.exists(paths, batch_size=jobs)
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 352, in exists
    if self.fs.async_impl:
  File "/opt/conda/lib/python3.8/site-packages/funcy/objects.py", line 47, in __get__
    return prop.__get__(instance, type)
  File "/opt/conda/lib/python3.8/site-packages/funcy/objects.py", line 25, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/opt/conda/lib/python3.8/site-packages/dvc_hdfs/__init__.py", line 58, in fs
    return HadoopFileSystem(**self.fs_args)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 79, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/arrow.py", line 278, in __init__
    fs = HadoopFileSystem(
  File "pyarrow/_hdfs.pyx", line 95, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: HDFS connection failed

2023-06-29 10:42:47,630 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-06-29 10:42:47,630 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,636 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,645 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,650 DEBUG: Removing '/home/singhab/dvc-example/.dvc/cache/.hEYTKG7bHugmicwMbkzNsk.tmp'
2023-06-29 10:42:47,665 DEBUG: Version info for developers:
DVC version: 2.58.2 (conda)
---------------------------
Platform: Python 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.22.0
        dvc_render = 0.5.3
        dvc_task = 0.2.1
        scmrepo = 1.0.3
Supports:
        hdfs (fsspec = 2023.6.0, pyarrow = 12.0.0),
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.161)
Config:
        Global: /home/singhab/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs4 on fs-a461edff.efs.eu-central-1.amazonaws.com:/halo-claim-singhab-jupyter-pvc-2486a8de-d78a-4515-8ea0-cf2fe89befe2
Caches: local
Remotes: hdfs, s3
Workspace directory: nfs4 on fs-a461edff.efs.eu-central-1.amazonaws.com:/halo-claim-singhab-jupyter-pvc-2486a8de-d78a-4515-8ea0-cf2fe89befe2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/e621ece895c6241383df59f56935951d

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-06-29 10:42:47,669 DEBUG: Analytics is enabled.
2023-06-29 10:42:47,708 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpdlov741k']'
2023-06-29 10:42:47,711 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpdlov741k']'
efiop commented 1 year ago

Looks like something is with your credentials/config. Does hdfs CLI work?

subodhdere commented 1 year ago

Hello, HDFS cli is working. Also shared .dvc/config file for more info.

singhab@jupyter-singhab-jupyter:~/dvc-example$ hdfs dfs -ls hdfs://am01.halo-telekom.com:8020/user/halo/dvc-usecase
Found 1 items
-rw-rw-r--+  3 singhab hadoop          0 2023-06-15 09:45 hdfs://am01.halo-telekom.com:8020/user/halo/dvc-usecase/test.txt

===================================================

singhab@jupyter-singhab-jupyter:~/dvc-example$ cat .dvc/config
[core]
    remote = myremotes3
['remote "myremote"']
    url = hdfs://am01.halo-telekom.com:8020/user/halo/dvc-usecase
    user = singhab
efiop commented 1 year ago

@subodhdere So what credentials are you using and how? kerberos maybe?

Overall seems like a configuration issue.

subodhdere commented 1 year ago

Hello Team, we are using kerberos for authentication.

efiop commented 1 year ago

@subodhdere You need to specify kerb ticket, see https://dvc.org/doc/user-guide/data-management/remote-storage/hdfs#hdfs-configuration-parameters

subodhdere commented 1 year ago

Hello Team, I have executed provided commands related to Kerberos authentication.

dvc remote modify --local myremote kerb_ticket FILE:/tmp/krb5cc_1000 dvc remote add -d myremote hdfs://am01.halo-telekom.com:8020/user/halo/dvc-usecase dvc remote modify --local myremote user "singhab@HALO-TELEKOM.COM"

You can refer below message in error.

hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_1000, userName=singhab@HALO-TELEKOM.COM) error:

efiop commented 1 year ago

@subodhdere Seems like the error is cut off.

subodhdere commented 1 year ago

Hello Team, Can you please look for below full error.

singhab@jupyter-singhab-jupyter:~/dvc-example$ dvc push -r myremote -v
2023-06-29 10:42:46,833 DEBUG: v2.58.2 (conda), CPython 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
2023-06-29 10:42:46,833 DEBUG: command: /opt/conda/bin/dvc push -r myremote -v
2023-06-29 10:42:47,284 DEBUG: Preparing to transfer data from '/home/singhab/dvc-example/.dvc/cache' to '/user/halo/dvc-usecase'
2023-06-29 10:42:47,284 DEBUG: Preparing to collect status from '/user/halo/dvc-usecase'
2023-06-29 10:42:47,284 DEBUG: Collecting status from '/user/halo/dvc-usecase'
2023-06-29 10:42:47,285 DEBUG: Querying 1 oids via object_exists                                                                                    
  0% Checking cache in '/user/halo/dvc-usecase'|                                                                         |0/? [00:00<?,    ?files/s]loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://am01.halo-telekom.com, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_1000, userName=singhab@HALO-TELEKOM.COM) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
/arrow/cpp/src/arrow/status.cc:137: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed. Detail: [errno 9] Bad file descriptor
2023-06-29 10:42:47,541 ERROR: unexpected error - HDFS connection failed                                                                            
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/opt/conda/lib/python3.8/site-packages/dvc/commands/data_sync.py", line 60, in run
    processed_files_count = self.repo.push(
  File "/opt/conda/lib/python3.8/site-packages/dvc/repo/__init__.py", line 65, in wrapper
    return f(repo, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dvc/repo/push.py", line 92, in push
    result = self.cloud.push(
  File "/opt/conda/lib/python3.8/site-packages/dvc/data_cloud.py", line 154, in push
    return self.transfer(
  File "/opt/conda/lib/python3.8/site-packages/dvc/data_cloud.py", line 135, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 178, in compare_status
    dest_exists, dest_missing = status(
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 149, in status
    odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback)
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 406, in oids_exist
    return list(wrap_iter(remote_oids, callback))
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 36, in wrap_iter
    for index, item in enumerate(iterable, start=1):
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 354, in list_oids_exists
    in_remote = self.fs.exists(paths, batch_size=jobs)
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 352, in exists
    if self.fs.async_impl:
  File "/opt/conda/lib/python3.8/site-packages/funcy/objects.py", line 47, in __get__
    return prop.__get__(instance, type)
  File "/opt/conda/lib/python3.8/site-packages/funcy/objects.py", line 25, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/opt/conda/lib/python3.8/site-packages/dvc_hdfs/__init__.py", line 58, in fs
    return HadoopFileSystem(**self.fs_args)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 79, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/arrow.py", line 278, in __init__
    fs = HadoopFileSystem(
  File "pyarrow/_hdfs.pyx", line 95, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: HDFS connection failed

2023-06-29 10:42:47,630 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-06-29 10:42:47,630 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,636 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,645 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,650 DEBUG: Removing '/home/singhab/dvc-example/.dvc/cache/.hEYTKG7bHugmicwMbkzNsk.tmp'
2023-06-29 10:42:47,665 DEBUG: Version info for developers:
DVC version: 2.58.2 (conda)
---------------------------
Platform: Python 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.22.0
        dvc_render = 0.5.3
        dvc_task = 0.2.1
        scmrepo = 1.0.3
Supports:
        hdfs (fsspec = 2023.6.0, pyarrow = 12.0.0),
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.161)
Config:
        Global: /home/singhab/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs4 on fs-a461edff.efs.eu-central-1.amazonaws.com:/halo-claim-singhab-jupyter-pvc-2486a8de-d78a-4515-8ea0-cf2fe89befe2
Caches: local
Remotes: hdfs, s3
Workspace directory: nfs4 on fs-a461edff.efs.eu-central-1.amazonaws.com:/halo-claim-singhab-jupyter-pvc-2486a8de-d78a-4515-8ea0-cf2fe89befe2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/e621ece895c6241383df59f56935951d

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-06-29 10:42:47,669 DEBUG: Analytics is enabled.
2023-06-29 10:42:47,708 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpdlov741k']'
2023-06-29 10:42:47,711 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpdlov741k']'
loveis98 commented 11 months ago

@subodhdere @efiop Hi! I have the same one error. Any updates? @subodhdere How did you deal with the error?

My traceback:

loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://cdp, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_298426831_298426831, userName=user05) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
/arrow/cpp/src/arrow/status.cc:137: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed. Detail: [errno 9] Bad file descriptor
ERROR: unexpected error - HDFS connection failed