nlesc-sigs / data-sig

Linked data, data & modeling SIG
Other
5 stars 3 forks source link

Deep rank data infrastructure #10

Closed c-martinez closed 6 years ago

c-martinez commented 6 years ago

@ridderl can you answer these questions?

ridderl commented 6 years ago

sample.zip

What is your final goal ?

Apply deeplearning to score protein-protein docking conformations

What is the challenge ?

We have lots of complexes, each with a lot of conformations, for which we need to compute multiple feature datasets and map them on 3D grids. The data can/should be compressed well. We want to be flexible in terms of: adding new complexes / features / maps, etc. We want to be able to generate and query in a parallel/distributed way. The data includes multiple types (ASCII data and numpy arrays). We already moved from a hierarchical folder structure to HDF5.

A sample of your data (if possible).

Unzip attached zip file

Reading the data


#open the file
f = h5py.File(fiename,'r')

# all the conformations stored in the file
f.keys() 

# the data of one molecule
list(f[<mol_name>].keys) 
# for each molecule group we have:
# 'complex'  : the atomic positions
# 'native'  : atomic position of a reference molecule
# 'features' : some properties of the molecule
# 'grid_points' : the coordinates of the grid
# 'mapped data' : the molecular properties mapped on the grid
# 'targets' : some targets for deeplearning

# different molecular features mapped on the grid
list(f[<mol_name>/mapped_feartures/])

# actual data
# a 30x30x30 grid containing the data
f[<mol_name>/mapped_features/<feature_name>/<feat>]

# e.g
f['1AK4_100w/mapped_features/AtomicDensities_diff/C'].value

We want to access a subset of a large data set. For example we want to load only specific mapped features for some molecules. This is done so far by going through the h5py file and selecting the mapped features specified by the user. But there might be a better solution.

Which technologies you are using to store and access the data.