Reading in frames from an IOBuffer

joegilkes commented 12 months ago

I have some code that generates XYZ geometries, and I need these to be converted into ExtXYZ frames. I can write these geometries to temporary files and read them back in as frames from those files, but this becomes incredibly taxing on disk IO when passing around thousands of XYZs, which I unfortunately have to do quite regularly.

I can write each geometry to a string, but I can't find a way to then read this string back in as a frame, since read_frames only works with a string for the path to the XYZ file, or with an IOStream to a file.

Is there some way of extending the current read_frames implementation to work with IOBuffers? That way I could write the string-form XYZ directly to an IOBuffer and read back in as a frame entirely within memory, which would be much faster.

jameskermode commented 12 months ago

The I/O is actually all done in C code in libextxyz using POSIX file descriptors with functions like fprintf(), fgets(), etc. If it were possible to get a valid file descriptor from a Julia IOBuffer then we could pass this along, similar to the way I allow Julia IOStream objects to be passed in via this cfopen() method

https://github.com/libAtoms/ExtXYZ.jl/blob/master/src/fileio.jl#L16

From a quick look, however, there is no fd(::IOBuffer) method defined. If you can find/suggest a way to do this I'd be happy to add it.

jameskermode commented 12 months ago

As a workaround you could create a ramdisk and do the I/O there: that might not be an option on machines where you are not root, however.

jameskermode commented 12 months ago

Alternatively there is a pure Python implementation of read_frame() here (pass use_cextxyz=False to avoid using the C extension). This version could be fairly easily modified to read from in-memory strings rather than doing I/O from disk. I would guess that this would still be slower than the fast C code, but I could be wrong about that.

A pure Julia parser based on the ExtXYZ Lark grammar would be possible using Lerch.jl, but this is a much bigger undertaking.

joegilkes commented 12 months ago

As far as I can tell, there's no way of associating a file descriptor with an IOBuffer - they're just not set up in a way where that makes sense. I could be wrong, I'll ask around a bit.

A ramdisk would be usable, but yeah this is mostly problematic on networked nodes where I don't have root access. The performance hit is manageable on my laptop where all IO is done on an NVMe drive, but anything slower is a bit painful.

The Python implementation could be workable, I can have a look at that.

joegilkes commented 12 months ago

On second thoughts, the Python version would require a bit more modification, since it doesn't currently appear to have an option to return the frames as Dicts, instead converting them straight to ASE Atoms objects.

jameskermode commented 12 months ago

From memory I think the dict version is there too, just one level deeper: read_frame_dicts() perhaps.

joegilkes commented 12 months ago

Ah yep, got it. I'd seen that function but didn't realise it was extracting the data in the same way. It would require a bit of reorganisation on my end to get the arrays key behaving the same way, but otherwise everything maps over nicely.

joegilkes commented 12 months ago

Something like this does the trick:

using PyCall

pyextxyz = pyimport("extxyz")

function xyz_to_frame(xyz_str::String)
    iob = IOBuffer(xyz_str)
    na, info, arrays, _ = pyextxyz.extxyz.read_frame_dicts(split(String(take!(iob)), '\n'), use_regex=false)
    close(iob)

    info = Dict{String, Any}(key => val for (key, val) in info)
    if na == 1
        arrays = Dict{String, Any}("species" => [arrays.item()[1]], "pos" => reduce(hcat, [arrays.item()[2]]))
    else
        arrays = Dict{String, Any}("species" => [a[1] for a in arrays], "pos" => reduce(hcat, [a[2] for a in arrays]))
    end
    frame = Dict{String, Any}("N_atoms" => na, "info" => info, "arrays" => arrays)
    return frame
end

There were a few tricky hurdles: for my input string-form XYZs I had to use use_regex=false (XYZs generated by OpenBabel, not sure why but I got errors with it enabled), and a clause for monoatomic XYZs had to be added since they end up yielding 0D NumPy arrays which have to be handled differently. This works well enough that the following holds:

# ExtXYZ.jl implementation
frame = read_frame(xyz_file)

# Python extxyz extension
my_frame = xyz_to_frame(xyz_str)

@assert frame == my_frame

I haven't done any formal benchmarking to see the performance hit/gain yet, but anecdotally loading in a large chemical reaction network's worth of XYZs was much speedier than before.

I have a lot of points in my workflow that require geometries to be passed back and forth between ExtXYZ form and OpenBabel (through PyCall), so I also implemented the reverse case (molecules loaded in as ExtXYZ frames with read_frames -> OpenBabel Molecules without writing to disk):

function frame_to_xyz(frame::Dict{String, Any})
    na = frame["N_atoms"]
    s = "$na\n"
    comment = join(["$key=$value" for (key, value) in frame["info"]], " ")
    s *= "$comment\n"
    for i in 1:na
        s *= "$(frame["arrays"]["species"][i]) $(frame["arrays"]["pos"][1, i]) $(frame["arrays"]["pos"][2, i]) $(frame["arrays"]["pos"][3, i])\n"
    end
    return s
end

pbmol = pybel.readstring("xyz", frame_to_xyz(my_frame))

This may also be contributing a lot to the potential speedup. I'll mark this as closed here, but if you're interested in the potential speedup then I can have a go at benchmarking the difference more thoroughly later in the week.

Thanks!

libAtoms / ExtXYZ.jl

Reading in frames from an IOBuffer #46