jamesrhester / CrystalInfoFramework.jl

Julia tools for reading Crystallographic Information Framework (CIF) files and dictionaries
GNU General Public License v3.0
12 stars 3 forks source link

Naming and architecture suggestions #8

Open Guillawme opened 3 years ago

Guillawme commented 3 years ago

Hello,

Since the README says

if you see ways to improve the naming or architecture, now is the time to raise an issue.

I suggest checking how much of the STAR format specification is supported by the package, and maybe renaming it to STAR.jl if enough of the specification is supported already and/or if complete (or "complete enough for most use cases") support is within the scope of this package. It seems to me that it is already compatible with STAR files following these conventions from RELION (I tried reading small files, it worked; trying to read a ~200 MB file with the native parser hanged for more than one hour until I cancelled the command; I can open a dedicated issue for that and provide the large file, if you want).

One other nice thing to have, which may not even need changing the architecture, would be a high-level function with which one could do something like read_star("/path/to/file.star") and get out a list of DataFrames (one per table present in the file) with properly typed columns (most of the time String, Int64 or Float64; not sure anything else is ever present in these files). And also the complementary write_star(df::DataFrame, "/path/to/file.star") (single table) and write_star(dfs::Vector{DataFrame}, "/path/to/file.star") (multiple tables).

jamesrhester commented 3 years ago

I've deliberately focused on CIF as that is widely used and recognised in both chemical crystallography and macromolecular crystallography. The package also includes a lot of CIF dictionary infrastructure. The RELION conventions look like they would be completely compatible with CIF, so a read_star method would be easy enough to implement - it would just be the same as reading a CIF.

Can you provide a link to the 200MB file. Currently native speed is about 1MB/2s so a 200Mb file should be 400s, or 7 minutes. There may be a bottleneck in parsing that I can fix quickly. It may also be a memory consumption issue.

Columns can only be typed if there is some knowledge of what their contents should be. CIF uses CIF dictionaries, so if you do associate a CIF dictionary to a CIF data block, correctly-typed columns are returned. In the case of STAR, there is no foolproof way of determining column contents (strings don't have to be delimited, I think) so you probably as a user need to indicate what types you want for columns and so it's probably just as simple to use DataFrame methods for this.

Guillawme commented 3 years ago

I opened a dedicated issue and posted the large file with more info there. See #9.

I mentioned typed columns because this is something starfile.py does automatically, and it is very convenient, but it's possible that it relies on pandas for this (just like reading a CSV file into a pandas dataframe also gives typed columns; but I indeed don't know how it can determine these types automatically). So, it's possible that DataFrames.jl can also guess column types (Base.parse() can do exactly this, it seems).

jamesrhester commented 3 years ago

Suggested enhancements: