rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.42k stars 213 forks source link

Pyarrow.lib.ArrowInvalid #258

Closed WailordHe closed 1 year ago

WailordHe commented 1 year ago

hi, I tried to deploy clip end follow the instruction of clip-retrieval/docs/laion5B_back.md and encountered the following error, it seems to be related to pyarrow, is there any way to bypass this error? thanks!

the error :

IO_FLAG_ONDISK_SAME_DIR: updating ondisk filename from /media/nvme/prepared_index/merged_index.ivfdata to laion5B-index/image.index/merged_index.ivfdata

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/clip/bin/clip-retrieval", line 8, in <module>
    sys.exit(main())
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/cli.py", line 16, in main
    fire.Fire(
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 968, in clip_back
    clip_resources = load_clip_indices(
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 941, in load_clip_indices
    clip_resources[name] = load_clip_index(clip_options)
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 895, in load_clip_index
    metadata_provider, ivf_old_to_new_mapping = load_metadata_provider(
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 622, in load_metadata_provider
    metadata_provider = ArrowMetadataProvider(mmap_folder)
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 601, in __init__
    [pa.ipc.RecordBatchFileReader(pa.memory_map(arrow_file, "r")).read_all() for arrow_file in arrow_files]
  File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 601, in <listcomp>
    [pa.ipc.RecordBatchFileReader(pa.memory_map(arrow_file, "r")).read_all() for arrow_file in arrow_files]
  File "pyarrow/ipc.pxi", line 805, in pyarrow.lib._RecordBatchFileReader.read_all
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: flatbuffer size 1920099464 invalid. File offset: 150790983960, metadata length: 224
rom1504 commented 1 year ago

Can you check you have the full file ?

On Tue, Mar 28, 2023, 19:52 wailord @.***> wrote:

hi, I tried to deploy clip end follow the instruction of clip-retrieval https://github.com/rom1504/clip-retrieval/docs https://github.com/rom1504/clip-retrieval/tree/main/docs/laion5B_back.md and encountered the following error, it seems to be related to pyarrow, is there any way to bypass this error? thanks!

the error:

IO_FLAG_ONDISK_SAME_DIR: updating ondisk filename from /media/nvme/prepared_index/merged_index.ivfdata to laion5B-index/image.index/merged_index.ivfdata

Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/clip/bin/clip-retrieval", line 8, in sys.exit(main()) File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/cli.py", line 16, in main fire.Fire( File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 968, in clip_back clip_resources = load_clip_indices( File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 941, in load_clip_indices clip_resources[name] = load_clip_index(clip_options) File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 895, in load_clip_index metadata_provider, ivf_old_to_new_mapping = load_metadata_provider( File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 622, in load_metadata_provider metadata_provider = ArrowMetadataProvider(mmap_folder) File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 601, in init [pa.ipc.RecordBatchFileReader(pa.memory_map(arrow_file, "r")).read_all() for arrow_file in arrow_files] File "/home/ma-user/anaconda3/envs/clip/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 601, in [pa.ipc.RecordBatchFileReader(pa.memory_map(arrow_file, "r")).read_all() for arrow_file in arrow_files] File "pyarrow/ipc.pxi", line 805, in pyarrow.lib._RecordBatchFileReader.read_all File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: flatbuffer size 1920099464 invalid. File offset: 150790983960, metadata length: 224

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QC5J4TGRZ5PNS4TKDW6MQNXANCNFSM6AAAAAAWK4WKYI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

WailordHe commented 1 year ago

Thank you, I redownloaded one of the index files and it works fine now!