Closed joegilkes closed 12 months ago
The I/O is actually all done in C code in libextxyz
using POSIX file descriptors with functions like fprintf()
, fgets()
, etc. If it were possible to get a valid file descriptor from a Julia IOBuffer then we could pass this along, similar to the way I allow Julia IOStream
objects to be passed in via this cfopen()
method
https://github.com/libAtoms/ExtXYZ.jl/blob/master/src/fileio.jl#L16
From a quick look, however, there is no fd(::IOBuffer)
method defined. If you can find/suggest a way to do this I'd be happy to add it.
As a workaround you could create a ramdisk and do the I/O there: that might not be an option on machines where you are not root, however.
Alternatively there is a pure Python implementation of read_frame()
here (pass use_cextxyz=False
to avoid using the C extension). This version could be fairly easily modified to read from in-memory strings rather than doing I/O from disk. I would guess that this would still be slower than the fast C code, but I could be wrong about that.
A pure Julia parser based on the ExtXYZ Lark grammar would be possible using Lerch.jl, but this is a much bigger undertaking.
As far as I can tell, there's no way of associating a file descriptor with an IOBuffer
- they're just not set up in a way where that makes sense. I could be wrong, I'll ask around a bit.
A ramdisk would be usable, but yeah this is mostly problematic on networked nodes where I don't have root access. The performance hit is manageable on my laptop where all IO is done on an NVMe drive, but anything slower is a bit painful.
The Python implementation could be workable, I can have a look at that.
On second thoughts, the Python version would require a bit more modification, since it doesn't currently appear to have an option to return the frames as Dict
s, instead converting them straight to ASE Atoms
objects.
From memory I think the dict version is there too, just one level deeper: read_frame_dicts() perhaps.
Ah yep, got it. I'd seen that function but didn't realise it was extracting the data in the same way. It would require a bit of reorganisation on my end to get the arrays
key behaving the same way, but otherwise everything maps over nicely.
Something like this does the trick:
using PyCall
pyextxyz = pyimport("extxyz")
function xyz_to_frame(xyz_str::String)
iob = IOBuffer(xyz_str)
na, info, arrays, _ = pyextxyz.extxyz.read_frame_dicts(split(String(take!(iob)), '\n'), use_regex=false)
close(iob)
info = Dict{String, Any}(key => val for (key, val) in info)
if na == 1
arrays = Dict{String, Any}("species" => [arrays.item()[1]], "pos" => reduce(hcat, [arrays.item()[2]]))
else
arrays = Dict{String, Any}("species" => [a[1] for a in arrays], "pos" => reduce(hcat, [a[2] for a in arrays]))
end
frame = Dict{String, Any}("N_atoms" => na, "info" => info, "arrays" => arrays)
return frame
end
There were a few tricky hurdles: for my input string-form XYZs I had to use use_regex=false
(XYZs generated by OpenBabel, not sure why but I got errors with it enabled), and a clause for monoatomic XYZs had to be added since they end up yielding 0D NumPy arrays which have to be handled differently. This works well enough that the following holds:
# ExtXYZ.jl implementation
frame = read_frame(xyz_file)
# Python extxyz extension
my_frame = xyz_to_frame(xyz_str)
@assert frame == my_frame
I haven't done any formal benchmarking to see the performance hit/gain yet, but anecdotally loading in a large chemical reaction network's worth of XYZs was much speedier than before.
I have a lot of points in my workflow that require geometries to be passed back and forth between ExtXYZ form and OpenBabel (through PyCall), so I also implemented the reverse case (molecules loaded in as ExtXYZ frames with read_frames
-> OpenBabel Molecule
s without writing to disk):
function frame_to_xyz(frame::Dict{String, Any})
na = frame["N_atoms"]
s = "$na\n"
comment = join(["$key=$value" for (key, value) in frame["info"]], " ")
s *= "$comment\n"
for i in 1:na
s *= "$(frame["arrays"]["species"][i]) $(frame["arrays"]["pos"][1, i]) $(frame["arrays"]["pos"][2, i]) $(frame["arrays"]["pos"][3, i])\n"
end
return s
end
pbmol = pybel.readstring("xyz", frame_to_xyz(my_frame))
This may also be contributing a lot to the potential speedup. I'll mark this as closed here, but if you're interested in the potential speedup then I can have a go at benchmarking the difference more thoroughly later in the week.
Thanks!
I have some code that generates XYZ geometries, and I need these to be converted into ExtXYZ frames. I can write these geometries to temporary files and read them back in as frames from those files, but this becomes incredibly taxing on disk IO when passing around thousands of XYZs, which I unfortunately have to do quite regularly.
I can write each geometry to a string, but I can't find a way to then read this string back in as a frame, since
read_frames
only works with a string for the path to the XYZ file, or with anIOStream
to a file.Is there some way of extending the current
read_frames
implementation to work withIOBuffer
s? That way I could write the string-form XYZ directly to anIOBuffer
and read back in as a frame entirely within memory, which would be much faster.