visual-layer / fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
Other
1.6k stars 77 forks source link

How do I specify an 'endpoint' in the code? #132

Closed xings19 closed 1 year ago

xings19 commented 1 year ago

My data is on a private aws, and I need to specify an 'endpoint' to execute commands such as 'aws s3 ls' correctly. How do I specify an 'endpoint' in the code?

dbickson commented 1 year ago

Hi @xings19 this is not supported yet, give us one day and we will release a new version where you can define an endpoint. Are you working on ubuntu 20?

xings19 commented 1 year ago

Thank you very much for your reply, I am using the centos.

CentOS Linux release 7.6.1810 (Core)

I see that the API on the colab demo file is very different from the API version that centos can use. Can you provide corresponding updates? If not, I can only trouble you to support the demand for specifying 'endpoint'. Thank you

dbickson commented 1 year ago

Thanks for clarifying, yes centos was not updated for a while this is a legacy version that we do not have many users for, I will also update the centos for the latest for you.

xings19 commented 1 year ago

Here's more information for you, when I execute a piece of code like this:

import fastdup
fastdup.run(input_dir="s3://mybucket/",work_dir="/my/work/dir/")

It will out put:

Going to loop over dir s3://mybucket/

This is equivalent to traversing my input path, and the traversal is achieved by the following command:

aws s3 ls --recursive s3://mybucket/ | awk '{print $4}' | egrep -i '\.bmp$|\.jpg$|\.tiff$|\.giff$|\.jpeg$|\.png$|\.tif$|\.tar$|\.tar.gz$|\.zip$|\.tgz$|\.mp4$|\.avi$'  > /my/work/dir/tmp/files.txt

In fact, in the official commands provided by aws, it is supported to pass "--endpoint=http://x.x.x.x:x". Therefore, the above command becomes the following so that it can be executed on my platform:

aws --endpoint=http://x.x.x.x:x s3 ls --recursive s3://mybucket/ | awk '{print $4}' | egrep -i '\.bmp$|\.jpg$|\.tiff$|\.giff$|\.jpeg$|\.png$|\.tif$|\.tar$|\.tar.gz$|\.zip$|\.tgz$|\.mp4$|\.avi$'  > /my/work/dir/tmp/files.txt

I think I have described my problem clearly, thanks a lot for your replies, it will speed up my research a lot.

dbickson commented 1 year ago

Hi @xings19 we have released version 0.907 for centos here: https://github.com/visual-layer/fastdup/releases/tag/v0.908 Please try it out. It has a new environment variable called FASTDUP_S3_ENDPOINT_URL. It is optional. In case it is given, it will add --endpoint-url=[value given in the env variable] to the s3 aws command. Please try it out and let us know if this works for you.

To set the environment variable, before running python do

export FASTDUP_S3_ENDPOINT_URL=https://path.to.your.endpoint

Or run python with

FASTDUP_S3_ENDPOINT_URL=https://path.to.your.endpoint pthon3.8 ....
xings19 commented 1 year ago

This package seems to have some bugs, when I download it and run it, I get the following error:

raceback (most recent call last):
  File "/mnt/lustre/username/.conda/envs/MAE/lib/python3.7/site-packages/fastdup/__init__.py", line 92, in <module>
    dll = CDLL(so_file)
  File "/mnt/lustre/username/.conda/envs/MAE/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libnnf.so: cannot open shared object file: No such file or directory
Please reach out to fastdup support, it seems installation is missing critical files to start fastdup.
We would love to understand what has gone wrong.
You can open an issue here: https://github.com/visual-layer/fastdup/issues or email us at info@databasevisual.com
Share out output of the command "find /mnt/lustre/username/.conda/envs/MAE/lib/python3.7/site-packages/fastdup "

I think this is because some dependent files are missing, so I found libnnf.so from the old version of the package I used initially, put it in the corresponding location, and executed the following command:

FASTDUP_S3_ENDPOINT_URL=http://x.x.x.x:x python dupimg.py

A new error has occurred:

terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_S_construct null not valid
Aborted

I'm not sure why this is, maybe you guys can fix it?

dbickson commented 1 year ago

HI @xings19 thanks for your detailed feedback, it helped us pinpoint the problem quickly. Please use this release: https://github.com/visual-layer/fastdup/releases/tag/0.908c and let us know if it works.

dbickson commented 1 year ago

On our side it seem to work now

[danny_bickson@fastdup-centos7-build cxx]$ export FASTDUP_S3_ENDPOINT_URL=https://s3.amazonaws.com
[danny_bickson@fastdup-centos7-build cxx]$ ./build/Release/src/fastdup s3://visualdb/sku110k --num_images=10 --work_dir=t1 
sh: dpkg: command not found
FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
This software is free for non-commercial and academic usage under the Creative Common Attribution-NonCommercial-NoDerivatives 4.0 International license. Please reach out to info@databasevisual.com for licensing options.
Model path is ./UndisclosedFastdupModel.ort
2023-03-22 07:18:44 [INFO] Going to loop over dir s3://visualdb/sku110k

2023-03-22 07:18:45 [INFO] Found total 10 images to run on
[■■■■■■                                            ] 11% Estimated: 0 Mi[■■■■■■■■■■■                                       ] 21% Estimated: 0 Mi[■■■■■■■■■■■■■■■■                                  ] 31% Estimated: 0 Mi[■■■■■■■■■■■■■■■■■■■■■                             ] 41% Estimated: 0 Mi[■■■■■■■■■■■■■■■■■■■■■■■■■■                        ] 50% Estimated: 0 Mi[■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                   ] 61% Estimated: 0 Mi[■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■              ] 71% Estimated: 0 Mi[■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■         ] 81% Estimated: 0 Mi[■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■    ] 91% Estimated: 0 Mi[■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] 100% Estimated: 0 M2023-03-22 07:18:51 [INFO] Found total 10 images to run on
2023-03-22 07:18:51 [INFO] 16) Finished write_index() NN model
2023-03-22 07:18:51 [INFO] Stored nn model index file t1/nnf.index
2023-03-22 07:18:51 [INFO] Total time took 6132 ms
2023-03-22 07:18:51 [INFO] Found a total of 0 fully identical images (d>0.990), which are 0.00 %
2023-03-22 07:18:51 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 %
2023-03-22 07:18:51 [INFO] Found a total of 14 above threshold images (d>0.850), which are 46.67 %
2023-03-22 07:18:51 [INFO] Found a total of 1 outlier images         (d<0.050), which are 3.33 %
2023-03-22 07:18:51 [INFO] Min distance found 0.796 max distance 0.934
2023-03-22 07:18:51 [INFO] Running connected components for ccthreshold 0.960000 
xings19 commented 1 year ago

I made a small mistake back then, it works fine now, thanks for your help!

dbickson commented 1 year ago

HI @xings19 great to know this is working for you. Feel free to reach out for any additional issues, your feedback helps to improve fastdup!

amirmk89 commented 1 year ago

Added to docs under Using AWS endpoints