jamesrhester / CrystalInfoFramework.jl

Julia tools for reading Crystallographic Information Framework (CIF) files and dictionaries
GNU General Public License v3.0
12 stars 3 forks source link

Performance issue with large files (~200 MB) #9

Open Guillawme opened 3 years ago

Guillawme commented 3 years ago

Hello,

As mentioned in #8, trying to read a STAR file about 200 MB in size hanged "forever" until I canceled the command (I waited a bit more than one hour). The Julia process doing this ate up to 14 GB of RAM (out of 16), and was still occupying more RAM (slowly) when I decided to cancel the command. This happened when I tried the commands below in a freshly opened Julia session (maybe I should have read a small file with similar structure first, to get compilation out of the way before trying to read the large file?).

julia> using CrystalInfoFramework, DataFrames, FilePaths
julia> test = Cif(p"particles.star")

Here is this particles.star file (link valid for 5 days): https://drop.chapril.org/download/311b4a22f7b03565/#X9xYmmEtcD4A4WZQKjbxvg

I can share even larger star files (up to ~800 MB) if you want to really stress test the package.

jamesrhester commented 3 years ago

I think the issue here is very large memory usage as the parse tree is being constructed. One fix would be to allow extracting information as soon as a syntactic item is matched, so that only the interesting items can be preserved and the rest thrown away instead of building a massive parse tree first. The package Lerche, which does the parsing, doesn't allow this yet. I'll raise an issue on that package.

Guillawme commented 3 years ago

In case this helps, I noticed after reporting this issue that the BioStructures.jl package can also read mmCIF (and simple STAR) files, and that its readmultimmcif function only takes a few seconds to read the same large file. I have no idea how different or similar their parser is, though.

jamesrhester commented 3 years ago

Interesting. That parser works by splitting the file into whitespace-separated tokens (handling quoted strings), then working through these tokens to allocate them to data blocks and data names. A different paradigm to the general one used here and clearly super fast.