materialsproject / pymatgen

Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.
https://pymatgen.org
Other
1.48k stars 850 forks source link

Read in POSCAR with `Structure.from_file` seems slow? #3793

Open DanielYang59 opened 4 months ago

DanielYang59 commented 4 months ago

Summary

Import a POSCAR with Structure.from_file() seems quite slow, a simple POSCAR would take around 1.5 seconds (not sure if I'm expecting too high though):

Test setup:

script:

from pymatgen.core.structure import Structure

poscar = Structure.from_file("./POSCAR")

POSCAR file:

Cubic BN
3.57
0.0 0.5 0.5
0.5 0.0 0.5
0.5 0.5 0.0
B N
1 1
Direct
0.00 0.00 0.00
0.25 0.25 0.25

Run time (10 times): 50.44s user 10.03s system 322% cpu 18.727 total

I would not expecting reading in such a simple POSCAR file to take almost 2 seconds, might worth further investigation.

Some preliminary investigation with line_profiler

Profiling log ```bash python3 -m line_profiler -rtmz profile_output.lprof Timer unit: 1e-06 s Total time: 1.21792 s File: /Users/yang/developer/pymatgen/pymatgen/core/structure.py Function: from_str at line 2860 Line # Hits Time Per Hit % Time Line Contents ============================================================== 2860 @classmethod 2861 @line_profiler.profile 2862 def from_str( # type: ignore[override] 2863 cls, 2864 input_string: str, 2865 fmt: FileFormats, 2866 primitive: bool = False, 2867 sort: bool = False, 2868 merge_tol: float = 0.0, 2869 **kwargs, 2870 ) -> Structure | IStructure: 2871 """Reads a structure from a string. 2872 2873 Args: 2874 input_string (str): String to parse. 2875 fmt (str): A file format specification. One of "cif", "poscar", "cssr", 2876 "json", "yaml", "yml", "xsf", "mcsqs", "res". 2877 primitive (bool): Whether to find a primitive cell. Defaults to 2878 False. 2879 sort (bool): Whether to sort the sites in accordance to the default 2880 ordering criteria, i.e., electronegativity. 2881 merge_tol (float): If this is some positive number, sites that 2882 are within merge_tol from each other will be merged. Usually 2883 0.01 should be enough to deal with common numerical issues. 2884 **kwargs: Passthrough to relevant parser. 2885 2886 Returns: 2887 IStructure | Structure 2888 """ 2889 1 5.0 5.0 0.0 fmt_low = fmt.lower() 2890 1 1.0 1.0 0.0 if fmt_low == "cif": 2891 from pymatgen.io.cif import CifParser 2892 2893 parser = CifParser.from_str(input_string, **kwargs) 2894 struct = parser.parse_structures(primitive=primitive)[0] 2895 1 0.0 0.0 0.0 elif fmt_low == "poscar": 2896 1 1217281.0 1e+06 99.9 from pymatgen.io.vasp import Poscar 2897 2898 1 606.0 606.0 0.0 struct = Poscar.from_str(input_string, default_names=None, read_velocities=False, **kwargs).structure 2899 elif fmt_low == "cssr": 2900 from pymatgen.io.cssr import Cssr 2901 2902 cssr = Cssr.from_str(input_string, **kwargs) 2903 struct = cssr.structure 2904 elif fmt_low == "json": 2905 dct = json.loads(input_string) 2906 struct = Structure.from_dict(dct) 2907 elif fmt_low in ("yaml", "yml"): 2908 yaml = YAML() 2909 dct = yaml.load(input_string) 2910 struct = Structure.from_dict(dct) 2911 elif fmt_low == "xsf": 2912 from pymatgen.io.xcrysden import XSF 2913 2914 struct = XSF.from_str(input_string, **kwargs).structure 2915 elif fmt_low == "mcsqs": 2916 from pymatgen.io.atat import Mcsqs 2917 2918 struct = Mcsqs.structure_from_str(input_string, **kwargs) 2919 # fleur support implemented in external namespace pkg https://github.com/JuDFTteam/pymatgen-io-fleur 2920 elif fmt == "fleur-inpgen": 2921 from pymatgen.io.fleur import FleurInput 2922 2923 struct = FleurInput.from_string(input_string, inpgen_input=True, **kwargs).structure 2924 elif fmt == "fleur": 2925 from pymatgen.io.fleur import FleurInput 2926 2927 struct = FleurInput.from_string(input_string, inpgen_input=False).structure 2928 elif fmt == "res": 2929 from pymatgen.io.res import ResIO 2930 2931 struct = ResIO.structure_from_str(input_string, **kwargs) 2932 elif fmt == "pwmat": 2933 from pymatgen.io.pwmat import AtomConfig 2934 2935 struct = AtomConfig.from_str(input_string, **kwargs).structure 2936 else: 2937 raise ValueError(f"Invalid {fmt=}, valid options are {get_args(FileFormats)}") 2938 2939 1 0.0 0.0 0.0 if sort: 2940 struct = struct.get_sorted_structure() 2941 1 1.0 1.0 0.0 if merge_tol: 2942 struct.merge_sites(merge_tol) 2943 1 23.0 23.0 0.0 return cls.from_sites(struct, properties=struct.properties) Total time: 1.22523 s File: /Users/yang/developer/pymatgen/test_import_speed/import.py Function: speed_test_func at line 6 Line # Hits Time Per Hit % Time Line Contents ============================================================== 6 @line_profiler.profile 7 def speed_test_func(poscar="POSCAR") -> None: 8 # Import the original POSCAR 9 1 1225234.0 1e+06 100.0 poscar = Structure.from_file(poscar) 1.22 seconds - /Users/yang/developer/pymatgen/pymatgen/core/structure.py:2860 - from_str 1.23 seconds - /Users/yang/developer/pymatgen/test_import_speed/import.py:6 - speed_test_func ```

Seems most of the runtime comes from from pymatgen.io.vasp import Poscar, not sure how to reduce import time (if possible)?

janosh commented 4 months ago

this is probably import overhead from loading pandas and maybe other big packages. if you start an interactive python session first, wait for it to load and then run poscar = Structure.from_file("./POSCAR") on its own, i expect it will be faster

DanielYang59 commented 4 months ago

Thanks for the input Janosh! I'm not quite sure as I didn't have time to do a thorough investigation (and I don't have extensive experience of import time profiling either). I would look into this later (input from people with more knowledge on this would be hugely appreciated of course).

I think for DFT calculations, input file import should usually not be the bottleneck, but almost two second to read such a simple file sound too much for me.

janosh commented 4 months ago

have a look at https://github.com/materialsproject/pymatgen/issues/3563 which tried to reduce startup time by lazy-importing pandas only in code that actually needs it. that seems to have back-fired though based on this comment https://github.com/materialsproject/pymatgen/issues/3563#issuecomment-1907031585. this takes careful testing than i did in https://github.com/materialsproject/pymatgen/pull/3568 and https://github.com/materialsvirtuallab/monty/pull/604 to make sure any changes really improve startup time