function for cif file IO and possible support for multi-dataset files

minhuanli commented 1 year ago

Make a record here in case we soon forget about it in the future.

Yesterday I had a discussion with Kevin @kmdalton about how to read structure factor data in cif format. And turns out it can be done with cif parser from gemmi (although not very well documented there).

Mostly Kevin's wisdom:

f = gemmi.cif.read_file("7l84-sf.cif")
refl_blocks = gemmi.as_refln_blocks(f)
refl_block = refl_blocks[0]
mtz = gemmi.CifToMtz().convert_block_to_mtz(refl_block)
ds = rs.DataSet.from_gemmi(mtz)

The above can be easily organized into a function like rs.read_cif().

One thing is that with possible multi datasets cif file, to be general, we have to think about the return as a representation of multi datasets. Maybe a generator? rs.read_cif(cif: str) -> generator ? Or move a step forward, we could implement a new class like DataSetCollection with methods to deal with multi datasets? A lot of decisions to make here.

JBGreisman commented 1 year ago

I like the concept -- I agree that cif format can and should be supported because it is already handled by gemmi. Regarding multi-dataset formats, a similar issue crops up with mtz files. Unmerged MTZs are currently handled using just a BATCH identifier, but technically the MTZ format supports things like individual unit cell parameters for different batches (particularly useful with serial data).

We do not currently handle this to the full extent that we could, and full support would likely require a new class as you said. I'm not yet sure how to do this in a way that still maintains the feel of pandas in a clean way.

kmdalton commented 1 year ago

i think we can easily support cif just as @minhuanli demonstrated. honestly, writing the test is the hardest part of that PR.

the multi-dataset object is a tricky one. there's probably a decent enough way to implement it, but it is not immediately obvious to me.

JBGreisman commented 1 year ago

cif support was added in #217

rs-station / reciprocalspaceship

function for cif file IO and possible support for multi-dataset files #195