Closed MarDiehl closed 1 year ago
No, there are no implied limitations, merely you run out of memory.
If you're using the document object model (i.e. loading a document) your memory usage is dependent on the number of nodes your file contains.
Another options is to do streaming event manipulation which consumes no memory, but you have to process your data on the fly.
I'm curious what does top displays when your load this huge file.
The error I was reporting before was related to a bug on my side. I use libfyaml in Fortran to convert to flow style which is easy to parse.
I've now written a small test program in pure c (given below) which I run on a system with 1TB of main memory. top
reports at max about 6% memory usage, but the test crashes with
<memory-@0x1457428e4010-0x145aac3054a9>:39979158:1: error: flow sequence without a closing bracket
- [1.0000053790139103, -1.0902280641739
^
^
Command terminated by signal 11
Command being timed: "./a.out"
User time (seconds): 95.30
System time (seconds): 35.25
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:10.90
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 66927676
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 25025753
Voluntary context switches: 8
Involuntary context switches: 12395
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
( I run it with /usr/bin/time -v
)
This YAML chunk is not aligned with any limit I can think of. In a file limited to 2GB, it is at 80% of the lines according to vim.
The test code is
/* Unix */
#include <stdio.h>
#include <stdlib.h>
#include <libfyaml.h>
void to_flow(char **flow, long* length_flow, const char *mixed){
struct fy_document *fyd = NULL;
enum fy_emitter_cfg_flags emit_flags = FYECF_MODE_FLOW_ONELINE | FYECF_STRIP_LABELS | FYECF_STRIP_TAGS |FYECF_STRIP_DOC;
fyd = fy_document_build_from_string(NULL, mixed, -1);
if (!fyd) {
*length_flow = -1;
return;
}
int err = fy_document_resolve(fyd);
if (err) {
*length_flow = -1;
return;
}
*flow = fy_emit_document_to_string(fyd,emit_flags);
*length_flow = (long) strlen(*flow);
fy_document_destroy(fyd);
}
int main(void) {
char *mixed = 0;
char *flow = 0;
long length_mixed;
long length_flow;
FILE *f;
f = fopen("mixed.yaml", "rb");
if (f)
{
fseek (f, 0, SEEK_END);
length_mixed = ftell (f);
fseek (f, 0, SEEK_SET);
mixed = malloc (length_mixed);
if (mixed)
{
fread (mixed, 1, length_mixed, f);
}
fclose (f);
}
if (mixed)
{
printf ("len(mixed) %ld\n",length_mixed);
to_flow (&flow,&length_flow,mixed);
printf ("len(flow) %ld\n",length_flow);
f = fopen ("flow.yaml","w+");
fputs (flow,f);
fclose (f);
}
}
Lol, 1TB of memory...
OK, I'll take a look...
Do you mind also doing a uname -a too?
uname -a
is Linux maws05 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
, gcc is gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1~20.04)
.
Note: The test code needed an update, the string was not terminated. Probably this caused the issue
Note2: I edited this post and removed wrong statements.
I have now a working test setup: Two YAML files:
2147483182 Feb 14 23:49 2Gb-.yaml
2147483990 Feb 14 22:10 2Gb+.yaml
The first one (slightly below the 2GB limit) works, the log from running the test with /usr/bin/time -v
is:
len(mixed) 2147483182
len(flow) 2023268549
Command being timed: "./a.out"
User time (seconds): 202.59
System time (seconds): 27.11
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:07.95
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 68544944
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 17135849
Voluntary context switches: 47357
Involuntary context switches: 21979
Swaps: 0
File system inputs: 0
File system outputs: 3951712
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
top
reports about 6.5% memory on a 1TB machine.
This is consistent with the amount of memory reported by /usr/bin/time
.
With the second file, the code stalls. top
reports 0.2% memory usage.
The test code is a fixed version of the code given above:
/* Unix */
#include <stdio.h>
#include <stdlib.h>
#include <libfyaml.h>
void to_flow(char **flow, long* length_flow, const char *mixed){
struct fy_document *fyd = NULL;
enum fy_emitter_cfg_flags emit_flags = FYECF_MODE_FLOW_ONELINE | FYECF_STRIP_LABELS | FYECF_STRIP_TAGS |FYECF_STRIP_DOC;
fyd = fy_document_build_from_string(NULL, mixed, -1);
if (!fyd) {
*length_flow = -1;
return;
}
int err = fy_document_resolve(fyd);
if (err) {
*length_flow = -1;
return;
}
*flow = fy_emit_document_to_string(fyd,emit_flags);
*length_flow = (long) strlen(*flow);
fy_document_destroy(fyd);
}
int main(void) {
char *mixed = 0;
char *flow = 0;
long length_mixed;
long length_flow;
FILE *f;
f = fopen("mixed.yaml", "rb");
if (f)
{
fseek (f, 0, SEEK_END);
length_mixed = ftell (f);
fseek (f, 0, SEEK_SET);
mixed = malloc (length_mixed+1);
if (mixed)
{
fread (mixed, 1, length_mixed, f);
}
mixed[length_mixed] = *"\0";
fclose (f);
}
if (mixed)
{
printf ("len(mixed) %ld\n",length_mixed);
to_flow (&flow,&length_flow,mixed);
printf ("len(flow) %ld\n",length_flow);
f = fopen ("flow.yaml","w+");
fputs (flow,f);
fclose (f);
}
}
I did some basic "debugging by print" and figured out that fy_document_builder_load_document
in by-docbuilder.c
gets stuck in it's while loop for large files.
I added a print statement:
while (!fy_document_builder_is_document_complete(fydb) &&
(fyep = fy_parse_private(fyp)) != NULL) {
rc = fy_document_builder_process_event(fydb, fyep);
printf("rc %d\n",rc);
fy_parse_eventp_recycle(fyp, fyep);
if (rc < 0) {
fyp->stream_error = true;
return NULL;
}
}
for files < 2GB, the screen is full with output (rc seems to be zero all the time), for > 2GB, there is only one print but the loop does not terminate.
some more information: in the second iteration, fy_parse_internal
in fy_parse_private
does not return.
the loop that never leaves is in fy_reader_skip_space
Possibly addressing this issue.
Please test and report back.
Also note that you don't have to read that full file in memory - if you use the file methods for loading the document the file will just be memory mapped and will be demand paged brought it.
It's easy to overlook those things with 1TB of memory I guess :)
works, many thanks for the quick fix!
A few observations from testing this that could be helpful
icx
(the new, LLVM-based compiler) I get a SIGILL error. I will try with different compilation options, for icc
, icx
, and gcc
to see if I get can more information.I certainly will consider a more memory-efficient alternative. So far, the memory consumption of the configuration file was not an issue, but for the specific problem I'm working on it became the bottleneck. It needs a few non-trivial adjustments in my code and I need to figure out how much it actually saves in comparison to the overall memory required to handle the data.
You are essentially using the convenience interfaces, so these are using more memory.
When you are talking about a 14GB string, is that the whole yaml file as a string? Or is it a single string that's taking 14GB.
There are a number of interfaces that allow zero re-allocation of memory and can even use memory mapped files as input.
namely the fy_token_iter*
family of methods.
These allow you to operate without any extra memory allocations at all, and without having to bring the whole file to memory.
In a couple of week I will push a new API that would be schema driven, which in theory would allow very efficient operation (as long as you can describe your data using a schema).
It is a file of 14GiB that I read into a string to convert it to flow style. Basically, I'm misusing libfyaml as a preprocessor to convert an YAML file with arbitrary syntax into a single line flow style string in which all references are resolved. Our Fortran parser lacks the capabilities for complex YAML features but happily accepts this flow style string.
As far as I understand, the other interfaces save memory by avoiding to read the string into memory in the first place. But if my understanding is correct, that would save 14GiB out of 450GiB. Or are there other savings that I miss?
Below is a file that is exemplary for my use case: The root is a dictionary that contains 2 dictionaries (phase
and homogenization
) and one list (material
). Both dictionaries remain small, but the list can have many entries, 33 Mio in the case of the 14GiB string.
The schema driven approach sounds good because the layout in material
is relatively fixed.
---
homogenization:
SX:
N_constituents: 1
mechanical: {type: pass}
phase:
Aluminum:
lattice: cF
mechanical:
output: [F, P, F_e, F_p, L_p, O]
elastic: {type: Hooke, C_11: 106.75e+9, C_12: 60.41e+9, C_44: 28.34e+9}
plastic:
type: phenopowerlaw
N_sl: [12]
a_sl: 2.25
atol_xi: 1.0
dot_gamma_0_sl: 0.001
h_0_sl-sl: 75.e+6
h_sl-sl: [1, 1, 1.4, 1.4, 1.4, 1.4, 1.4]
n_sl: 20
output: [xi_sl]
xi_0_sl: [31.e+6]
xi_inf_sl: [63.e+6]
material:
- homogenization: SX
constituents:
- phase: Aluminum
v: 1.0
O: [1.0, 0.0, 0.0, 0.0]
- homogenization: SX
constituents:
- phase: Aluminum
v: 1.0
O: [0.7936696712125002, -0.28765777461664166, -0.3436487135089419, 0.4113964260949434]
V_e: [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0]]
- homogenization: SX
constituents:
- phase: Aluminum
v: 1.0
O: [0.3986143167493579, -0.7014883552495493, 0.2154871765709027, 0.5500781677772945]
V_e: [[0.999, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0]]
- homogenization: SX
constituents:
- phase: Aluminum
v: 1.0
O: [0.28645844315788244, -0.022571491243423537, -0.467933059311115, -0.8357456192708106]
V_e: [[0.999, 0.0, 0.0], [0.0, 0.999, 0.0], [0.0, 1.0, 0.0]]
I've tested with gcc
, icc
, and icx
with and without debug flags and with two files: Just below and just above 4GiB. The results are the same:
The sanitizer of gcc reports a memory issue (gcc debug build). Since it seems that there is enough free memory, I would consider it rather a bug or limitation of the memory sanitizer. The icx debug build terminates with SIGKILL. This could also be a compiler limitation. All other jobs are fine.
Build script and log files are attached:
https://github.com/pantoniou/libfyaml/commit/f7493107ee2bec6cca1ca300a3dcffc37ce3cb07
This is a patch that supports streaming alias resolution.
Can you give it a try to resolve one of your files with:
$ fy-tool --dump --streaming --resolve <input.yaml> > <output.yaml>
It should have dramatically smaller memory requirements.
I tried with b045020816676c5148e03babb8350a1ec3c72a8a and a 7.5 GB file.
fy-tool --dump --streaming --resolve
results in a maximum resident set size of 162 GB in comparison to 311 GB for the non-streaming version (fy-tool --dump --resolve
).
The full log is
maws05 ➜ libfyaml_test /usr/bin/time -v fy-tool --dump --streaming --resolve material.yaml > material_out.yaml
Command being timed: "fy-tool --dump --streaming --resolve material.yaml"
User time (seconds): 1049.68
System time (seconds): 54.69
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 18:25.06
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 170267880
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 40746895
Voluntary context switches: 4
Involuntary context switches: 107087
Swaps: 0
File system inputs: 0
File system outputs: 18647328
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
maws05 ➜ libfyaml_test /usr/bin/time -v fy-tool --dump --resolve material.yaml > material_out2.yaml
Command being timed: "fy-tool --dump --resolve material.yaml"
User time (seconds): 1266.27
System time (seconds): 282.59
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 25:50.06
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 326518260
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 159496931
Voluntary context switches: 5
Involuntary context switches: 149336
Swaps: 0
File system inputs: 0
File system outputs: 18647328
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
maws05 ➜ libfyaml_test ls -lah
total 26G
drwxr-xr-x 2 m.diehl ma 4.0K Mar 26 18:03 .
drwxr-xr-x 72 m.diehl msu 4.0K Mar 26 18:04 ..
-rw-r--r-- 1 m.diehl ma 7.5G Mar 9 02:11 material.yaml
-rw-r--r-- 1 m.diehl ma 8.9G Mar 26 16:22 material_out.yaml
-rw-r--r-- 1 m.diehl ma 8.9G Mar 26 16:48 material_out2.yaml
a 'quick' comparison to Python (libyaml I assume): 493 GB and a runtime of almost 4h (compared to <20min for streaming):
maws05 ➜ libfyaml_test /usr/bin/time -v ./test.py
Command being timed: "./test.py"
User time (seconds): 13481.90
System time (seconds): 681.39
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:57:44
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 517126628
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 354181061
Voluntary context switches: 176889
Involuntary context switches: 1362762
Swaps: 0
File system inputs: 0
File system outputs: 17377800
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
#!/usr/bin/env python3
import yaml
x = yaml.load(open('material.yaml','r'), Loader=yaml.CSafeLoader)
with open('material_out3.yaml','w') as f:
f.write(yaml.dump(x,Dumper=yaml.CSafeDumper))
I tried with b045020 and a 7.5 GB file.
fy-tool --dump --streaming --resolve
results in a maximum resident set size of 162 GB in comparison to 311 GB for the non-streaming version (fy-tool --dump --resolve
).The full log is
maws05 ➜ libfyaml_test /usr/bin/time -v fy-tool --dump --streaming --resolve material.yaml > material_out.yaml Command being timed: "fy-tool --dump --streaming --resolve material.yaml" User time (seconds): 1049.68 System time (seconds): 54.69 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 18:25.06 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 170267880 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 40746895 Voluntary context switches: 4 Involuntary context switches: 107087 Swaps: 0 File system inputs: 0 File system outputs: 18647328 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 maws05 ➜ libfyaml_test /usr/bin/time -v fy-tool --dump --resolve material.yaml > material_out2.yaml Command being timed: "fy-tool --dump --resolve material.yaml" User time (seconds): 1266.27 System time (seconds): 282.59 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 25:50.06 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 326518260 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 159496931 Voluntary context switches: 5 Involuntary context switches: 149336 Swaps: 0 File system inputs: 0 File system outputs: 18647328 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 maws05 ➜ libfyaml_test ls -lah total 26G drwxr-xr-x 2 m.diehl ma 4.0K Mar 26 18:03 . drwxr-xr-x 72 m.diehl msu 4.0K Mar 26 18:04 .. -rw-r--r-- 1 m.diehl ma 7.5G Mar 9 02:11 material.yaml -rw-r--r-- 1 m.diehl ma 8.9G Mar 26 16:22 material_out.yaml -rw-r--r-- 1 m.diehl ma 8.9G Mar 26 16:48 material_out2.yaml
Well, it's much better, but I guess the vmstats are not quite right. What you're seeing is an artifact of how the malloc implementation prefers to allocate memory instead of trying to re-use it. You are never using that much memory.
Can you try using the --null-output option and reporting back?
It should only report the parser overhead then, and not the emitter.
Something did not sit right with me so I run a more thorough debugging session.
Turns out (some) tokens were not released to the correct recycling list when emitting.
Please try streaming mode with https://github.com/pantoniou/libfyaml/commit/378c84f16018baf63c1675e21f53f24b09c5d082
Things should be dramatically better (as in almost no memory consumption over what the input file requires).
P.S. I do enjoy our back and forth, makes for a better YAML processor.
Impressive!
39f774503b4bd84ea595b5bd98e47a35736627b6 gives
maws05 ➜ /tmp /usr/bin/time -v fy-tool --dump --streaming --resolve material.yaml > material_out.yaml
Command being timed: "fy-tool --dump --streaming --resolve material.yaml"
User time (seconds): 776.26
System time (seconds): 16.19
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 13:14.20
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7760872
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 160
Minor (reclaiming a frame) page faults: 181394
Voluntary context switches: 304
Involuntary context switches: 77422
Swaps: 0
File system inputs: 15357568
File system outputs: 18647344
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
maws05 ➜ /tmp ls -lah materi*
-rw-r--r-- 1 m.diehl ma 7.5G Apr 6 01:03 material.yaml
-rw-r--r-- 1 m.diehl ma 8.9G Apr 13 20:12 material_out.yaml
which means 7.4 GB for a 7.5 GB file.
I'm using YAML for configuration of scientific simulations and in exceptional cases, the file size can exceed 2GB. When using such a large file, I get
[ERR]: fy_parse_load_document() failed
.Could this be related to using
int
for some length-related operations or is it most likely caused by another limitation?If it is only an integer overflow: Is a MR changing
int
tosize_t
where needed welcomed? Or is the higher memory consumption not acceptable (having in mind that I'm probably the only person with ridiculously large YAML files).