project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
205 stars 42 forks source link

crash with very large mmCIF files (and large number of datablock) in gemmi grep #298

Closed CV-GPhL closed 4 months ago

CV-GPhL commented 4 months ago

Looking at SF mmCIF files for e.g. 5sds (and also 5smj 5smk) using e.g.

wget https://files.rcsb.org/download/5SDS-sf.cif.gz
gemmi grep "_symmetry.*" -t 5SDS-sf.cif.gz

causes a complete crash (and reboot) of a Ubuntu 22.04 Linux box (64Gb memory) - most likely because it tries to first load the whole file with its 319 data blocks into memory before doing the grep?

Would it be possible to change this so that it reads and processes each datablock in turn? After all, each datablock is completely independent of the other and is basically processed as if they are independent files anyway, right?

wojdyr commented 4 months ago

I tried this command and it ran successfully for me, using 3GB of memory:

Maximum resident set size (kbytes): 2940204

The only thing loaded into memory here was the uncompressed content of gz file (2936157 kB).

What version/build of gemmi do you use?

How much swap space do you have?

To test it without running out of memory, set ulimit -m to some safe value.

CV-GPhL commented 4 months ago

Hi Marcin,

On Tue, Feb 27, 2024 at 01:16:17AM -0800, Marcin Wojdyr wrote:

I tried this command and it ran successfully for me, using 3GB of memory:

Maximum resident set size (kbytes): 2940204

The only thing loaded into memory here was the uncompressed content of gz file (2936157 kB).

What version/build of gemmi do you use?

(gemmi 0.6.5-dev (v0.6.4-66-gb2de392))

I'm trying to run multiple similar jobs in parallel - since most of them are for very small, single-datablock examples (and I don't want to run one file at a time if going e.g. through the whole PDB). When it then hits one of those very large files, it crashes my machine.

How much swap space do you have?

To test it without running out of memory, set ulimit -m to some safe value.

I could do various things on my side to avoid that kind of problem, yes.

Could you on your side maybe also have a look if handling a multi-datablock mmCIF file (within the gemmi grep tool) could be improved by working on one block at a time? Maybe as an option?

Cheers

Clemens

wojdyr commented 4 months ago

I'm trying to run multiple similar jobs in parallel - since most of them are for very small, single-datablock examples (and I don't want to run one file at a time if going e.g. through the whole PDB). When it then hits one of those very large files, it crashes my machine.

Is it using also 3GB of memory on your computer?

CV-GPhL commented 4 months ago

Hi Marcin,

On Tue, Feb 27, 2024 at 02:58:29AM -0800, Marcin Wojdyr wrote:

I'm trying to run multiple similar jobs in parallel - since most of them are for very small, single-datablock examples (and I don't want to run one file at a time if going e.g. through the whole PDB). When it then hits one of those very large files, it crashes my machine.

Is it using also 3GB of memory on your computer?

It looks like it - with about 60GB of VIRT/RES memory used (looking at "top").

Please note that running

gemmi -B cif2mtz ...

is also quite slow (does that also read the whole CIF file to then convert the requested block)? I'm currently looking at 5sbf and converting the first 18 datablocks takes about 40 min for me here ...

Of course, there is also the question if pushing hundreds of reflection data blocks into a mmCIF (for a single PDB deposition) is a good idea in the first place ... but that is more something for the depositor and/or the wwPDB ;-)

Cheers

Clemens

wojdyr commented 4 months ago

Is it using also 3GB of memory on your computer?

It looks like it - with about 60GB of VIRT/RES memory used (looking at "top").

I don't understand. Does that command use 3GB or 60GB of memory on your computer? If 3GB is too much, you can uncompress the file, then it will use only a few MBs.

CV-GPhL commented 4 months ago

Hi Marcin,

On Tue, Feb 27, 2024 at 04:40:24AM -0800, Marcin Wojdyr wrote:

Is it using also 3GB of memory on your computer?

It looks like it - with about 60GB of VIRT/RES memory used (looking at "top").

I don't understand. Does that command use 3GB or 60GB of memory on your computer?

I'm not an expert in memory management/diagnostics ... so I might have checked at the wrong time.

So maybe the "gemmi grep" command is fine and it is the next "cif2mtz" that is using too much memory:

gemmi cif2mtz -B 1 r5sdssf.ent.gz r5sdssf.mtz

takes 83 secs using 33GB of memory, while

gunzip -c r5sdssf.ent.gz | awk '/^data_/{n++;if(n==2)exit}{print}' > tmp.ent && gzip -v tmp.ent gemmi cif2mtz -B 1 tmp.ent.gz tmp.mtz

takes 0.5 sec using 146MB.

Does that make sense?

Cheers

Clemens

wojdyr commented 4 months ago

I've been using /bin/time -v command to check peak memory usage, although I don't know how reliable it is.

If you want to convert the first block only with cif2mtz, don't specify -B 1, just the filenames. It will be much faster, because this case is optimized – but it works only if the first block is up to 1GB.

CV-GPhL commented 4 months ago

Hi Marcin,

On Tue, Feb 27, 2024 at 07:29:52AM -0800, Marcin Wojdyr wrote:

If you want to convert the first block only with cif2mtz, don't specify -B 1, just the filenames. It will be much faster, because this case is optimized – but it works only if the first block is up to 1GB.

Could it be similarly "optimized" for any other block? With some of these very large mmCIF files with 100s of blocks it is nice that the first one might be done faster, but it is the other (N-1) which are slowing things down in total ;-)

Of course, I could first split the mmCIF into separate files and then run cif2mtz over each one in turn.

Cheers

Clemens

wojdyr commented 4 months ago

Hi Clemens,

if you want to convert all blocks to MTZ files, the fastest way is to do it in one go:

gemmi cif2mtz [options] CIF_FILE --dir=DIRECTORY

Additionally, you can make it faster by compiling gemmi with zlib-ng (if you haven't yet).

It's not possible to optimize reading the nth block in the same way as reading the first block, because we know in advance only where the first block starts. Also, in some files only the first block has a unit cell and wavelength – in such cases the following blocks inherit these values. Reading nth block could be optimized to some extent, but it'd make the program more complicated. Usually, one wants either the first block or all blocks, so the converter is optimized for these two cases.

CV-GPhL commented 4 months ago

Hi Marcin,

On Wed, Feb 28, 2024 at 02:19:53AM -0800, Marcin Wojdyr wrote:

if you want to convert all blocks to MTZ files, the fastest way is to do it in one go:

gemmi cif2mtz [options] CIF_FILE --dir=DIRECTORY

Maybe - but I don't want all blocks - but more than one/first: only those of a given "type", e.g. judged by the presence or absence of data items). Maybe I need to first generate all 100s of MTZ files and then delete again those that I don't want ...

Cheers

Clemens

wojdyr commented 4 months ago

Then all the blocks need to be read anyway, to check what items they have. You could, if it's worth it, write a Python script that converts only selected blocks, see: https://gemmi.readthedocs.io/en/latest/hkl.html#mtz-mmcif