rs-station / careless

Merge X-ray diffraction data with Wilson's priors, variational inference, and metadata
MIT License
16 stars 6 forks source link

Metadata for CrystFEL stream files #141

Closed ElkeDeZitter closed 9 months ago

ElkeDeZitter commented 10 months ago

Hi,

I wish to try careless on a stream file from CrystFEL. However, I cannot find the correct way to attribute metadata. Based on the thermolysis_xfel example and the columns for indexed crystals in the stream file I tried the following: careless mono --spacegroups="P 21 21 21" "dhkl,fs/px,ss/px" my_protein.stream careless_merge/my_protein with and without addition of --intensity-key="I" , --uncertainty-key="sigma(I)" and --image-layers=2

Using those arguments, I get errors like raise KeyError(f"None of [{key}] are in the [{axis_name}]") KeyError: "None of [Index(['dhkl', 'fs/px', 'ss/px'], dtype='object')] are in the [columns]" KeyError: 'sigma(I)'

Using the stream2mtz.py script, I could convert the stream file to mtz, with which careless appears to run fine (metadata: "dHKL,XDET,YDET,ewald_offset,angular_ewald_offset"). However, if I understood correctly, careless could interpret CrystFEL streams directly without the need to convert first.

I am using careless version 0.3.9 installed without gpu-support on a Mac.

kmdalton commented 10 months ago

Hi @ElkeDeZitter,

We do have a test for stream file support, but I am afraid it is not very well documented. I haven't really had many users trying it so far.

Internally, careless just uses the reciprocalspaceship stream file parser. This will provide the following metadata keys:

[ins] In [3]: rs.read_crystfel("crystfel.stream").keys()
Out[3]: 
Index(['I', 'SigI', 'BATCH', 's1x', 's1y', 's1z', 'ewald_offset', 
'angular_ewald_offset', 'XDET', 'YDET'],  dtype='object')

Of these I would recommend using the scattered beam wavevectors, s1x and s1y in lieu of XDET and YDET which are the coordinates within each detector panel. It should be harmless to supply both, but the s1 vectors are more generally useful. When processing serial, stills, we typically provide both ewald_offset and angular_ewald_offset which are the cartesian distance and angular rotation between the predicted spot centroid and the ewald sphere.

Additionally, careless will provide dHKL (case sensitive), image_id (should be used in lieu of BATCH if you have multiple stream files.

I suggest the following command:

careless mono \
  --spacegroups="P 21 21 21"  \
  --intensity-key="I" \
  --uncertainty-key="SigI" \
  --image-layers=2 \
  "dHKL,s1x,s1y,ewald_offset,angular_ewald_offset"  \
  my_protein.stream \
  careless_merge/my_protein

Not sure if you have stills or rotation images, but if you have stills the ewald offset metadata are really essential for good scaling.

Let me know if this solves your problem. I'll leave this issue open and try to add some more info into the CLI help next week.

ElkeDeZitter commented 10 months ago

Hi @kmdalton ,

Thank you for the response and further explanation which is very helpful (how to get all metadata keys, difference between s1x and s1y vs XDET and YDET). Now careless runs fine with my stream file, which containing still images (thus I provided ewald_offset,angular_ewald_offset).

I haven't tested the case of multiple stream files.

kmdalton commented 10 months ago

Great! Don't hesitate to reach out for tips. The best/worst thing about careless is that it has a lot of knobs you can tweak. Happy to offer some suggestions. For instance, we have found that using positional encoding on some of the metadata can help with merging serial synchrotron data. You can do this like:

--positional-encoding-keys="s1x,s1y" \
--positional-encoding-frequencies=5

This will add a lot of additional columns to the metadata behind the scenes. Because careless uses the width of the metadata matrix as the default for the width of the neural net layers, this can use a lot of memory if you don't pair it with the --mlp-width flag to override the default. For instance:


--positional-encoding-keys="s1x,s1y" \
--positional-encoding-frequencies=5 \
--mlp-width=10

10 is a pessimistic value. You can often get away with narrower nets if you're memory limited.