rs-station / careless

Merge X-ray diffraction data with Wilson's priors, variational inference, and metadata
MIT License
16 stars 6 forks source link

Exceeding NotebookApp.iopub_data_rate_limit #135

Closed gyuhyeokcho closed 7 months ago

gyuhyeokcho commented 1 year ago

Hi, @kmdalton,

I've been working on scaling and merging multiple crystal diffraction data in Google Colab. However, I've encountered a particular issue. Here's the command I used:

!mkdir -p 1_DOF32 !careless mono \ --studentt-likelihood-dof=32 \ --disable-image-scales \ --merge-half-datasets \ --iterations=30_000 \ --test-fraction 0.05 \ --wilson-prior-b 80 \ --dmin 2.100 \ "BATCH,dHKL,Hobs,Kobs,Lobs,XCAL,YCAL,ZCAL,RLP,PEAK,CORR,MAXC,XOBS,YOBS,ZOBS,ALF0,BET0,ALF1,BET1,PSI,ISEG" \ INTEGRATE_2298defa_hkl2mtz.mtz \ INTEGRATE_2299beam_hkl2mtz.mtz \ INTEGRATE_2302save1_hkl2mtz.mtz \ INTEGRATE_2303defa_hkl2mtz.mtz \ 1_DOF32/DOF32

I received the following output:

Careless version 0.3.9 Metadata column "ISEG" with zero standard deviation will not be standardized. /usr/local/lib/python3.10/site-packages/careless/io/formatter.py:29: UserWarning: Metadata column "ISEG" with zero standard deviation will not be standardized. warnings.warn(message) IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable --NotebookApp.iopub_data_rate_limit.

Current values: NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec) NotebookApp.rate_limit_window=3.0 (secs)

I tried increasing the NotebookApp.iopub_data_rate_limit, but it seems that this value is fixed in Google Colab and cannot be easily changed. I've noticed that others who faced similar messages from Colab bypassed the issue by removing the print statement. Would it be possible to disable the print statement in Careless to resolve this issue?

kmdalton commented 1 year ago

Most of what careless prints to screen is just the progress bar which uses tqdm. tqdm actually sends a new line of text to the terminal every time the progress bar is updated. I'm guessing this is what colab is complaining about. You can disable the bar with the --disable-progress-bar flag.

Unrelated note: If the value of "ISEG" is the same for all your reflections, I would recommend removing it from your metadata kwargs.

gyuhyeokcho commented 1 year ago

I used the --disable-progress-bar flag when running the command and removed the "ISEG" column from the metadata keywords.

Here's the results:

Careless version 0.3.9 IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable --NotebookApp.iopub_data_rate_limit.

Current values: NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec) NotebookApp.rate_limit_window=3.0 (secs)

41), (0, -26, -27, 40), (0, -26, -27, 39), (0, -26, -27, 38), (0, -26, -27, 37), (0, -26, -27, 3), (0, 26, 27, 2), (0, 26, 27, 3), (0, 26, 27, 4), (0, 26, 27, 5), (0, 26, 27, 6), (0, 26, 27, 7), (0, 26, 27, 8), (0, -26, -28, 35), (0, -26, -28, 34), (0, -26, -28, 33), (0, -26, -28, 32), (0, -26, -28, ...(thousands of numbers)... 67), (0, -55, 11, 65), (0, -56, 10, 65), (0, -56, 10, 64), (0, -57, 9, 64), (0, -57, 11, 63), (0, -57, 11, 62), (0, -58, 10, 62), (0, -59, 9, 61), (0, -59, 11, 60), (0, -60, 10, 59), (0, -61, 9, 58), (0, -61, 9, 59), (0, -62, 10, 56), (0, -64, 10, 53)] not in index'

A similar "not in index" issue occurred in another dataset with multiple crystal data.

Careless version 0.3.9 Traceback (most recent call last): File "/usr/local/bin/careless", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/site-packages/careless/careless.py", line 9, in main run_careless(parser) File "/usr/local/lib/python3.10/site-packages/careless/careless.py", line 30, in run_careless inputs,rac = df.format_files(parser.reflection_files) File "/usr/local/lib/python3.10/site-packages/careless/io/formatter.py", line 143, in format_files return self((load(f) for f in files)) File "/usr/local/lib/python3.10/site-packages/careless/io/formatter.py", line 121, in call return self.finalize(data, rac) File "/usr/local/lib/python3.10/site-packages/careless/io/formatter.py", line 344, in finalize refl_id = rac.to_refl_id( File "/usr/local/lib/python3.10/site-packages/careless/io/asu.py", line 172, in to_refl_id return self.asu_and_miller_lookup_table.loc[idx, 'id'].to_numpy('int').flatten() File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1097, in getitem return self._getitem_tuple(key) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1280, in _getitem_tuple return self._getitem_lowerdim(tup) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 976, in _getitem_lowerdim return self._getitem_nested_tuple(tup) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1077, in _getitem_nested_tuple obj = getattr(obj, self.name)._getitem_axis(key, axis=axis) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1332, in _getitem_axis return self._getitem_iterable(key, axis=axis) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1272, in _getitem_iterable keyarr, indexer = self._get_listlike_indexer(key, axis) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1462, in _get_listlike_indexer keyarr, indexer = ax._get_indexer_strict(key, axis_name) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 2539, in _get_indexer_strict return super()._get_indexer_strict(key, axis_name) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_strict self._raise_if_missing(keyarr, indexer, axis_name) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 2559, in _raise_if_missing return super()._raise_if_missing(key, indexer, axis_name) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5941, in _raise_if_missing raise KeyError(f"{not_found} not in index") KeyError: '[(0, -40, -28, 34), (0, -39, -17, 43), (0, -38, -22, 46), (0, -37, -22, 49), (0, -37, -18, 50), (0, -37, -17, 50), (0, -36, -36, 31), (0, -36, -30, 44), (0, -36, -18, 53), (0, -36, -17, 53), (0, -36, -13, 52), (0, -35, -34, 39), (0, -35, -13, 55), (0, -34, -14, 58), (0, -34, -8, 55), (0, -34, -5, 52), (0, -33, -12, 60), (0, -33, -4, 54), (0, -33, 0, 48), (0, -33, 1, 46), (0, -33, 6, 32), (0, -32, -35, 43), (0, -32, -31, 51), (0, ...(thousands of numbers)... (0, -22, -44, 14), (0, -24, -39, 41), (0, -24, -41, 33), (0, -24, -42, 28), (0, -24, -44, 13), (0, -25, -43, 21), (0, -25, -44, 12), (0, 26, -13, 43), (0, -26, -43, 20), (0, -27, 17, 14), (0, -27, -44, 9), (0, -28, 11, 41), (0, -28, -43, 17), (0, -28, -44, 6), (0, -29, 9, 43), (0, -29, -39, 36), (0, -29, -43, 15), (0, -30, -41, 26), (0, -31, 0, 55), (0, -32, -3, 56), (0, -34, -11, 57)] not in index'

kmdalton commented 1 year ago

Hi @gyuhyeokcho,

Can you tell me what the cell and spacegroup are for your inputs? You can get this from rs.mtzdump *.mtz. I think your issue is that your inputs are not isomorphous, and careless doesn't formally support that right now (see: https://github.com/rs-station/careless/issues/35). You can workaround the issue by setting the cell and spacegroup to be the same for all the inputs. This is straightforward to do with reciprocalspaceship. Let me know if you need help!

I think this is an area where careless needs to be improved. I would love feedback from users about what you would actually like to see happen in this scenario? As I see it the options are:

Perhaps for the final two options, it'd be a good idea to print a warning as well. Let me know what you think. This is a very longstanding issue, and I love to fix it.

DorisMai commented 1 year ago

Just throwing another idea here — my current workaround is adding an argument ”—unitcell” to allow me to supply a set of parameters and ignore the ones from input mtz files.

I think this is an area where careless needs to be improved. I would love feedback from users about what you would actually like to see happen in this scenario? As I see it the options are:

  • Refuse to merge non-isomorphous cells and provide a helpful error message
  • Use the first cell in the list to determine the ASU for merging
  • Use the average cell to determine the ASU for merging

Perhaps for the final two options, it'd be a good idea to print a warning as well. Let me know what you think. This is a very longstanding issue, and I love to fix it.

kmdalton commented 1 year ago

@DorisMai, we do encourage pull requests around these parts :)

kmdalton commented 1 year ago

Related: #136

gyuhyeokcho commented 1 year ago

Thank you for your feedback! I've noticed that I used multiple datasets with different space groups for Careless. Here are the rs.mtzdump results for these datasets. I'd like to change the space group and unit cell parameters of the other datasets with the first one in each set. Can you advise on how to do this?

Datasets #1 Spacegroup: C2 Extended Hermann-Mauguin name: C 1 2 1 Unit cell dimensions: 166.247 95.476 198.074 90.000 90.394 90.000

Spacegroup: C2 Extended Hermann-Mauguin name: C 1 2 1 Unit cell dimensions: 166.158 95.321 198.066 90.000 90.420 90.000

Spacegroup: P1 Extended Hermann-Mauguin name: P 1 Unit cell dimensions: 96.022 96.358 199.240 89.300 89.946 60.186

Spacegroup: C2 Extended Hermann-Mauguin name: C 1 2 1 Unit cell dimensions: 167.024 95.944 199.729 90.000 90.779 90.000

Datasets #2 Spacegroup: P1 Extended Hermann-Mauguin name: P 1 Unit cell dimensions: 94.120 94.060 194.799 87.683 88.221 60.081

Spacegroup: P1 Extended Hermann-Mauguin name: P 1 Unit cell dimensions: 94.019 94.120 195.117 87.754 88.328 60.158

kmdalton commented 1 year ago

Hi @gyuhyeokcho, this is pretty easy to do in python with reciprocalspaceship depending on your level of programming experience. Here is a simple script which should do what you want:

import reciprocalspaceship as rs

reference_mtz = 'dataset_1.mtz'

mtzs = [
    'dataset_2.mtz',
    'dataset_3.mtz',
    'dataset_4.mtz',
]

ds = rs.read_mtz(reference_mtz)
cell = ds.cell
spacegroup = ds.spacegroup

for mtz in mtzs:
    ds = rs.read_mtz(mtz)
    ds.cell = cell
    ds.spacegroup = spacegroup
    output_mtz = mtz[:-4] + "_updated.mtz"
    ds.write_mtz(output_mtz)

I will caution you that in some cases, reindexing operations need to applied to the Miller indices in addition to just changing the cell. This is important if you are either changing the space group or have a space group where there are multiple indexing solutions (indexing ambiguity). Neither of these space groups have indexing ambiguities. However, you won't be able to merge a P1 indexed data set with a C2 without doing some more work. I hesitate to offer more guidance without knowing more about your specific use case.

I hope this makes sense. Please let me know if this solution works for you.

gyuhyeokcho commented 12 months ago

It worked for me. Thank you for your help!