rs-station / reciprocalspaceship

Tools for exploring reciprocal space
https://rs-station.github.io/reciprocalspaceship/
MIT License
28 stars 12 forks source link

Allow custom suffix specification for `DataSet.stack_anomalous()` #102

Closed dennisbrookner closed 2 years ago

dennisbrookner commented 2 years ago

This change was pretty straightforward. You can now call, for example:

DataSet.stack_anomalous(suffixes=("+", "-"))

I did a little bit more rearranging of the tests in tests_dataset_anomalous.py than was strictly necessary, but I think what's there now will be more adaptable to future approaches. I'm happy to adapt to any style conventions I've broken. Ditto for any changes made to dataset.py itself.

Two remaining thoughts

Two odd behaviors that I did not create, but I also did not fix:

"Suffixes"

The documentation suggests that the suffixes argument looks for suffixes, which isn't actually strictly true; the code

plus_labels = [l for l in self.columns if suffixes[0] in l]

will also find columns that contain the "suffix" in the middle. I'm inclined to think this is fine; I could imagine, for example, you could end up with random white space at the end of a column, or that you could have a column name like I(+)_off. The question is whether to mention in the docstring that this is an option. I think I lean towards no, as it would be more confusing than useful, but I thought I'd mention it.

Explicitly provided custom names will not have their suffixes stripped in the output

As part of this PR, I changed

new_labels = [l.rstrip("(+)") for l in plus_labels]

to

new_labels = [l.rstrip(suffixes[0]) for l in plus_labels]

where your custom suffix will get stripped off of the column name (as desired).

However, it occurs to me that if your manually supplied column names had suffixes other than (+) / (-), say:

DataSet.stack_anomalous(plus_labels=["Iplus", "SIGIplus"], ["Iminus", "SIGIminus"])

then nothing would get stripped, because l.rstrip(suffixes[0]) is still looking for "(+)".

Even worse, if for whatever reason, your suffixes didn't match, say:

DataSet.stack_anomalous(plus_labels=["I(+)", "SIGIplus"], ["I(-)", "SIGIminus"])

then only the (+) suffixes get stripped.

To be clear, the new functionality

DataSet.stack_anomalous(suffixes=("plus", "minus"))

does work here, and strips off the custom suffix as expected.

Of course, this is a pretty random edge case, and it doesn't even break anything, it just gives you weird column names. I don't think there's a great way around this without breaking reverse compatibility, so it's probably fine.

codecov-commenter commented 2 years ago

Codecov Report

Merging #102 (b6cea4e) into main (6a6120b) will decrease coverage by 0.05%. The diff coverage is 100.00%.

:exclamation: Current head b6cea4e differs from pull request most recent head eb2958b. Consider uploading reports for the commit eb2958b to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##             main     #102      +/-   ##
==========================================
- Coverage   98.97%   98.91%   -0.06%     
==========================================
  Files          41       41              
  Lines        1561     1567       +6     
==========================================
+ Hits         1545     1550       +5     
- Misses         16       17       +1     
Flag Coverage Δ
unittests 98.91% <100.00%> (-0.06%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
reciprocalspaceship/dataset.py 98.78% <100.00%> (-0.23%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5d7f876...eb2958b. Read the comment docs.

JBGreisman commented 2 years ago

Regarding your comments:

Suffixes

I think it makes sense to change the conditional to specifically look for suffixes. This can be done with something like the following:

plus_labels = [l for l in self.columns if l.endswith(suffixes[0])]

I think this change makes the most sense because of how the argument is named in the call signature. By enforcing the behavior to only work on suffixes, it also clarifies one type of the other corner cases you identified (namely, that friedel-identifiers won't get stripped if they aren't at the end).

Other edge cases

Thanks for pointing out the other use cases. In general, I think of such cases as "user errors", because they do result from explicitly provided arguments that produced an inconsistency.

However, one idea here is that all these cases ultimately result from providing explicit plus_labels and minus_labels for which the suffixes do not match. As such, it would be possible to flag such discrepancies by checking whether the suffixes are present in the respective label. We could raise an ValueError under such scenarios, because it does result from inconsistent arguments being provided to the function.