mosdef-hub / gmso

Flexible storage of chemical topology for molecular simulation
https://gmso.mosdef.org
MIT License
53 stars 32 forks source link

Pandas simplification v2 #826

Open CalCraven opened 1 month ago

CalCraven commented 1 month ago

This is a replacement PR for the features requested in #814, which has some commits in the merge history that deviated when switching to ruff linting.

CalCraven commented 1 month ago

From previous PR:

This PR looks to improve the handling for converting a topology to a dataframe. This currently lives as a method for topology. It is now being moved to a convert_dataframe.py module. A few different formats are available which give some nice default ways to view a topology. Notably, we have the formats: -publication which gives all the parameter values you would want to have in a table for publication. This also removes duplicates so each parameter is only listed once. -default some default values which are nice to have -remove_duplicates which allows you to get a smaller dataframe with duplicate rows removed. -specific_columns Allows the user to specify what they want in the dataframe.

There is also an added function that allows you to generate dataframes that cover the parameters for a set of topologies.

Finally, there will be some function that prints the dataframes with the rdkit mols which are labeled to match the dataframes.

TODO Checklist:

CalCraven commented 1 month ago

From discussion with @Vtsoch, there are a few more general use cases that would be nice to have working in the arguments for the main function, to_dataframeDict.

dfDict = to_dataframeDict(ptop, parameters="sites", columns=["name", "atom_type.name", "atom_type.parameters", "charge", "molecule.name"], format="remove_duplicates")

should be put as an example, since it could be hard to find the molecule info or parameters info if you don't know how the parsing works of these attributes. I would even consider these attributes to be in the default format since they're nice to know.

dfDict = to_dataframeDict(ptop, parameters=["sites", "bonds"], columns=["name"], format="specific_columns")

This will fail currently. However, I think it should just go through the columns and only grab attributes that exist, skipping the others.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 91.66667% with 13 lines in your changes missing coverage. Please review.

Project coverage is 93.36%. Comparing base (fbac310) to head (b364e48).

Files Patch % Lines
gmso/external/convert_dataframe.py 91.55% 13 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #826 +/- ## ========================================== - Coverage 94.11% 93.36% -0.76% ========================================== Files 65 66 +1 Lines 6870 7005 +135 ========================================== + Hits 6466 6540 +74 - Misses 404 465 +61 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.