Deep rank data infrastructure

What is your final goal ?

Apply deeplearning to score protein-protein docking conformations

What is the challenge ?

We have lots of complexes, each with a lot of conformations, for which we need to compute multiple feature datasets and map them on 3D grids. The data can/should be compressed well. We want to be flexible in terms of: adding new complexes / features / maps, etc. We want to be able to generate and query in a parallel/distributed way. The data includes multiple types (ASCII data and numpy arrays). We already moved from a hierarchical folder structure to HDF5.

Questions:
- HDF5 (currently implemented) vs Sqlite (may have advantage allowing flexible querying)
- Storing all complexes in one big file, or one file per complex
Writing and reading HDF5 or sqlite3 in parallel
Storing large HDF5/sqlite3 files (e.g. with Spark / Hadoop)

A sample of your data (if possible).

Unzip attached zip file

Reading the data


#open the file
f = h5py.File(fiename,'r')

# all the conformations stored in the file
f.keys() 

# the data of one molecule
list(f[<mol_name>].keys) 
# for each molecule group we have:
# 'complex'  : the atomic positions
# 'native'  : atomic position of a reference molecule
# 'features' : some properties of the molecule
# 'grid_points' : the coordinates of the grid
# 'mapped data' : the molecular properties mapped on the grid
# 'targets' : some targets for deeplearning

# different molecular features mapped on the grid
list(f[<mol_name>/mapped_feartures/])

# actual data
# a 30x30x30 grid containing the data
f[<mol_name>/mapped_features/<feature_name>/<feat>]

# e.g
f['1AK4_100w/mapped_features/AtomicDensities_diff/C'].value

We want to access a subset of a large data set. For example we want to load only specific mapped features for some molecules. This is done so far by going through the h5py file and selecting the mapped features specified by the user. But there might be a better solution.

Which technologies you are using to store and access the data.

Python
Numpy
HDF5
(pickle) We abandoned it
Sqlite3?

nlesc-sigs / data-sig