keerthys / krumping

0 stars 0 forks source link

Investigate various options for storing the media files on server. #6

Closed keerthys closed 10 years ago

vikymg commented 10 years ago

First point to be ensured is

  1. if DB is actually required or we can store directly in the file system.
  2. if DB is required, what type of DB will be best fit for media files. SQL or NOSQL DB. Will update pros and cons for the same.
vikymg commented 10 years ago

Commonly suggested approaches.

  1. Store media files in file system and a database which contains the meta data of the video ( like URL and other video details). Fetching from file system is quicker but slows down when number of files are increased to folds of tens of milliions. SAN boxes ( Cost in lakhs to crores) will be needed when implemented in large scale.
  2. store it as a blob in DB. Better handling of files like easy replication and easy control but comparatively slow as it involves conversion from binary format to machine understandable format every time a media is fetched. Need highly efficient database design to avoid slow fetching and requires partitioning as the table grows. For DB column oriented No SQL db's are better fit compared to relational ones. Add your view on this.
vikymg commented 10 years ago

As a initial step I am thinking of going with the first approach. Once the prototype is done. We have to compare the performance using RDBMS and No SQL environment as well to come to a final conclusion on the approach. Hadoop for video streaming also needs to be validated before coming to a conclusion.

What do you feel? - Refer last comment for further updates on final conclusion.

vikymg commented 10 years ago

Supporting links for the above points: ( For future reference when we revisit the post) http://www.viiratech.com/tutorials/good-programming-practice/storing-media-files-database-file-system.html http://stackoverflow.com/questions/154707/what-is-the-best-way-to-store-media-files-on-a-database http://blog.mongodb.org/post/183689081/storing-large-objects-and-files-in-mongodb

vikymg commented 10 years ago

Superb link. Handles all questions: Final conclusion: Mysql it is. http://akashkava.com/blog/127/huge-file-storage-in-database-instead-of-file-system/

Add your views if any keerthy.

keerthys commented 10 years ago

Nice investigation and a great start with regard to various options we have. But still we cannot settle upon our decision to mysql IMO.

When we do a math as follows, Total number of producers * average number of videos uploaded * size of each content 1 lakh producers - 100000 * 10 * 250 MB ~ 230 TB 10K producers ~ 23 TB

Also if we replicate the DB content, then it will double itself which will again increase the size.

I have considered 1 lakh users in the above calculation. Of course it will take lot of time to scale to that level. But these are hard decisions to change at later point. So we should do careful analysis supporting data point and rough estimates of data requirement,

Since we are allowing video content size can tremendously grow,so we should do a better estimate in this front. We should validate our data requirements and whether each approach could cater to that need.

keerthys commented 10 years ago

We can have another task that evaluates the various options with the rough estimate of numbers.

vikymg commented 10 years ago

There are only 2 variable factor here.. 1. Individual file size 2. Total no of files.. Individual file size is not a problem as blob needs to be split as fixed smaller chunks while storing in db.. With respect to total number of files. More the number of files more the db size.. One way to deal this problem is partitioning data based on the trend/ upload date.. And to deal with scalability issues of huge data growth.. We can move to MySQL cloud which supports multiple redundant replications across multiple locations In the later phase and should cater to our needs.. Feel free to add your views..

keerthys commented 10 years ago

I don't have prior hands on experience with DB, so I don't have any specific suggestion at this moment. Based on your investigation, mysql seems way to go. We can start prototyping with mysql and can measure performance (by many parallel connection to fetch a video).