rtp-aws / devpost_aws_disaster_response

rtp-aws.org submission for devpost.com AWS Disaster Response Hackathong
MIT License
4 stars 4 forks source link

clean data task - use md5sum to toss any camera images which are identical #3

Closed netskink closed 2 years ago

netskink commented 2 years ago

perhaps not needed but to be pedantic it could be done.

dvntaudio commented 2 years ago

Ok this sounds like more utility

dvntaudio commented 2 years ago

Create table or array with components to diff by md5 hash

dvntaudio commented 2 years ago

Md5sum run against camera uploads folder .. pipe print to stdout. Then reject the duplicate hashes

netskink commented 2 years ago

Yes, I know how to do it. Add the code to do it, so we can run it periodically with cron.

netskink commented 2 years ago

@dvntaudio look at the crontab and script files. I want to run the scan and then run that tool a mintue or so later.

netskink commented 2 years ago

Talked with @ArjunPanwar2005 about this.

netskink commented 2 years ago

install cygwin if you are on windows. It will give you bash.

adrianxdev commented 2 years ago

check out this PR: https://github.com/rtp-aws/devpost_aws_disaster_response/pull/11. There are 1,400 duplicates. The notebook will delete them or I can run it and delete them on my end which will create a PR to delete the duplicates.

netskink commented 2 years ago

@adrianxdev

I approved the pull request. I was looking to see if it was something pulled with an existing license. It looks like its your original code. /rockandroll

fwiw, here is another

https://towardsdatascience.com/removing-duplicate-or-similar-images-in-python-93d447c1c3eb

I want to have it as a bash script or straight .py file so I can add it to the cron file. I suppose I can run a notebook, but can you make a .py please before we close the issue.

adrianxdev commented 2 years ago

i got it from here: https://medium.com/@urvisoni/removing-duplicate-images-through-python-23c5fdc7479e with minor tweaks to work in this env.

He has the py here: https://github.com/UrviSoni/remove_duplicate_image/blob/master/duplicate_image_remove.py