rs-station / reciprocalspaceship

Tools for exploring reciprocal space
https://rs-station.github.io/reciprocalspaceship/
MIT License
28 stars 12 forks source link

Add Support for Reading DIALS `refl` Files #78

Open kmdalton opened 3 years ago

kmdalton commented 3 years ago

As we discussed extensively on the DIALS Slack channel, it is now relatively easy to parse DIALS .refl files without cctbx/DIALS. Newer versions of DIALS encode reflection tables using msgpack which seems a relatively innocuous dependency to add.

To this end @ndevenish has built a parser that decodes refl tables using numpy. It's nearly complete but may be missing column types. We can find a full list of types in this block. It should be easy to build this into the rs.io submodule as I've done here for example.

There remains the issue of DIALS reflection tables potentially containing some fairly exotic objects (shoeboxes, vectors, matrices). The safest (sadly slowest) thing to do for a first pass is to just default them to objects. We can think about clever solutions later.

Parsing legacy pickle based reflection tables is an open question. For the time being, I think we just can't support them. @ndevenish suggests looking here for clues though.

@JBGreisman, let's chat about this early next week and get it up and running. I think this is already mostly there!

kmdalton commented 3 years ago

@PrinceWalnut , you may want to help here too.

ndevenish commented 3 years ago

I've updated it to handle the missing types, and a basic pytest test (it can be run refl_loader.py --write-test inside a cctbx environment to write it's test file). std::string is also supposed to be handled by the msgpack writer but is rather broken - https://github.com/dials/dials/issues/1858 - so am pretty sure that's not "in the wild" anywhere.

kmdalton commented 3 years ago

wacky. we'll certainly be skipping any string columns for now at least...

JBGreisman commented 3 years ago

This looks like a pretty solid start. There are a few columns that can be stored in a DataFrame/DataSet, but cannot be written to an MTZ file. I think a decision we will have to make is whether those columns should be skipped, or whether they should be parsed and included in the DataSet.

It is always possible to add new dtypes if that would improve the behavior of anything (for strings, pandas already has us covered: StringDtype). It is also possible to add multidimensional numpy arrays as columns, but they seem to really end up more as "lists of arrays"