Hipi on Spark - Githubissues

uvagfx / hipi

HIPI: Hadoop Image Processing Interface

http://hipi.cs.virginia.edu

BSD 3-Clause "New" or "Revised" License

133 stars 82 forks source link

Hipi on Spark #31

Open sdikby opened 7 years ago

sdikby commented 7 years ago

Dear HIPI developers,

do you plan on integrating apache spark instead of the old mapreduce?? if so when? Otherwise could you give me some hints on how to do it? My use case is that i need to classify millions of images and with mapreduce it will not be efficient as i need it to be. @sweeneychris @liuliu @voigtlandier @zverham @hafnium

yangboz commented 7 years ago

@sdikby have you tried hipi hibImport.sh with millions of images successfully?

sdikby commented 7 years ago

@yangboz sorry for the delay. no, i didn't even start to use HIPI. My use case is to process millions of images in hadoop. But i don't think that it is performant enough with MapReduce or if it is even possible with Hipi, as it is not maintained since around a year now (the last commit was on 12 april).

yangboz commented 7 years ago

@sdikby thanks for your reply, totally agree with your comments of lack of updates on HIPI source code, also I found code issue #30, none response.. by the way, except the HIPI solution, any other Hadoop sequence file solutions for millions of images files?

sdikby commented 7 years ago

@yangboz i know some 2 other tools for image processing, but i didn't try them yet (i just began my master thesis :) ) there is Mipr: https://github.com/sozykin/mipr and this one: https://github.com/okstate-robotics/hipl The two are based also on mapreduce. Otherwise i don't know the ´difference between them. Feel free to test them and i would be happy to get a feedback from you

yangboz commented 7 years ago

@sdikby thanks for your ideas suggestion, I will try them, and my ideas comes from : http://dinesh-malav.blogspot.com/2015/05/image-processing-using-opencv-on-hadoop.html , It is a great tutorial on CDH(MR1)+HIPI v1+ant, but nowadays,HIPI using gradlew, version v2+,that's why I am struggling on code base modifications.

sdikby commented 7 years ago

@yangboz it would be also great to know how the 3 tools/frameworks store images on hdfs (to deal with the blocksize problem for example) and the big differences between them(read/write performance from/into hdfs).

yangboz commented 7 years ago

@sdikby before those 3 tools/framework, existed solutions that I have studied on Ceph and even Cassandra image blob storage. Conclusion will coming soon.

yangboz commented 7 years ago

@sdikby compare Mipr: https://github.com/sozykin/mipr (full documentation an code example passed) with this one: https://github.com/okstate-robotics/hipl (missing of documentation!)

sdikby commented 7 years ago

@yangboz oh good job ! and what's about performance? did you compare the both in terms of # image write/read per second? and how they both store images on HDFS, specially how they deal with the block size problem ?

yangboz commented 7 years ago

@sdikby there is a paper(please drop a letter to me if you need it.) on hadoop/spark performance compare includes indexing and retrieval according to its compare result, integrate hadoop and spark to process 160k pictures on 30 node cluster that improve the efficiency.

sdikby commented 7 years ago

@yangboz could you please provide me this paper. I would do a performance test between the 3 cited frameworks in the next months.