orientechnologies / prjhub

Project Hub
0 stars 4 forks source link

Configuration for high volume of data #90

Closed masuij closed 8 years ago

masuij commented 8 years ago

Greetings,

We had some outstanding concerns/questions on production configuration for a large data set e.g. on the order of 40 billion entries(each representing a small set of vertices and edges) and handling of potentially 25 million entry writes on a daily basis. These nodes would form what we're supposing are many small trees. We would like recommendations on:

  1. hardware configurations for individual servers
  2. we would ideally like to not have to manage sharding/clustering but recommendations ideal management either from the app or at the database level itself
  3. ideal configuration of partitioning servers e.g. should some be designated for writes and others for reads? how many for each?
  4. data migration strategies e.g. if we wanted to export and then import to graphx
  5. performance considerations upon scaling up

We've read through the documentation and would like these issues expanded

Thanks, J.Masui

SDIPro commented 8 years ago

Hi Justin,

  1. I would lean more towards having more RAM and disk space than super-fast CPU cores. I suggest 16GB of RAM for starters. It depends greatly on the balance between your writes and your queries. If you're write-heavy, the query cache may not be that beneficial. I'd also recommend SSD disks.
  2. Auto-sharding is a coming feature. So, you may not have to do much on this front in the near future.
  3. Do you have a feel yet on the balance between the two as far as quantity and size?
  4. OrientDB supports exporting data in many formats. The most commonly used is JSON.
  5. The main issue when scaling up is where the data is located (across how many machines) and how quickly can you retrieve all the connections. The data model design plays a big part in this. OrientDB is good at horizontal scaling for many operations, but it depends on your quorum and synchronous requirements.

My best advice is to create a prototype of the intended workload and size of the data and then test, test, test and get help from us for optimizations.

-Colin

SDIPro commented 8 years ago

(Answered follow-up via email. Closing...)