Open Guillawme opened 3 years ago
I think the issue here is very large memory usage as the parse tree is being constructed. One fix would be to allow extracting information as soon as a syntactic item is matched, so that only the interesting items can be preserved and the rest thrown away instead of building a massive parse tree first. The package Lerche, which does the parsing, doesn't allow this yet. I'll raise an issue on that package.
In case this helps, I noticed after reporting this issue that the BioStructures.jl package can also read mmCIF (and simple STAR) files, and that its readmultimmcif
function only takes a few seconds to read the same large file. I have no idea how different or similar their parser is, though.
Interesting. That parser works by splitting the file into whitespace-separated tokens (handling quoted strings), then working through these tokens to allocate them to data blocks and data names. A different paradigm to the general one used here and clearly super fast.
Hello,
As mentioned in #8, trying to read a STAR file about 200 MB in size hanged "forever" until I canceled the command (I waited a bit more than one hour). The Julia process doing this ate up to 14 GB of RAM (out of 16), and was still occupying more RAM (slowly) when I decided to cancel the command. This happened when I tried the commands below in a freshly opened Julia session (maybe I should have read a small file with similar structure first, to get compilation out of the way before trying to read the large file?).
Here is this
particles.star
file (link valid for 5 days): https://drop.chapril.org/download/311b4a22f7b03565/#X9xYmmEtcD4A4WZQKjbxvgI can share even larger star files (up to ~800 MB) if you want to really stress test the package.