pantoniou / libfyaml

Fully feature complete YAML parser and emitter, supporting the latest YAML spec and passing the full YAML testsuite.
MIT License
239 stars 73 forks source link

support for huge files #70

Closed MarDiehl closed 1 year ago

MarDiehl commented 1 year ago

I'm using YAML for configuration of scientific simulations and in exceptional cases, the file size can exceed 2GB. When using such a large file, I get [ERR]: fy_parse_load_document() failed.

Could this be related to using int for some length-related operations or is it most likely caused by another limitation?

If it is only an integer overflow: Is a MR changing int to size_t where needed welcomed? Or is the higher memory consumption not acceptable (having in mind that I'm probably the only person with ridiculously large YAML files).

pantoniou commented 1 year ago

No, there are no implied limitations, merely you run out of memory.

If you're using the document object model (i.e. loading a document) your memory usage is dependent on the number of nodes your file contains.

Another options is to do streaming event manipulation which consumes no memory, but you have to process your data on the fly.

I'm curious what does top displays when your load this huge file.

MarDiehl commented 1 year ago

The error I was reporting before was related to a bug on my side. I use libfyaml in Fortran to convert to flow style which is easy to parse.

I've now written a small test program in pure c (given below) which I run on a system with 1TB of main memory. top reports at max about 6% memory usage, but the test crashes with

<memory-@0x1457428e4010-0x145aac3054a9>:39979158:1: error: flow sequence without a closing bracket  
    - [1.0000053790139103, -1.0902280641739                                                         
                                           ^
                                           ^                                                        
Command terminated by signal 11                                                                     
  Command being timed: "./a.out"                                                                    
  User time (seconds): 95.30                                                                        
  System time (seconds): 35.25                                                                      
  Percent of CPU this job got: 99%                                                                  
  Elapsed (wall clock) time (h:mm:ss or m:ss): 2:10.90                                              
  Average shared text size (kbytes): 0                                                              
  Average unshared data size (kbytes): 0                                                            
  Average stack size (kbytes): 0                                                                    
  Average total size (kbytes): 0                                                                    
  Maximum resident set size (kbytes): 66927676                                                      
  Average resident set size (kbytes): 0                                                             
  Major (requiring I/O) page faults: 0                                                              
  Minor (reclaiming a frame) page faults: 25025753                                                  
  Voluntary context switches: 8                                                                     
  Involuntary context switches: 12395                                                               
  Swaps: 0                                                                                          
  File system inputs: 0                                                                             
  File system outputs: 8                                                                            
  Socket messages sent: 0                                                                           
  Socket messages received: 0                                                                       
  Signals delivered: 0                                                                              
  Page size (bytes): 4096                                                                           
  Exit status: 0 

( I run it with /usr/bin/time -v)

This YAML chunk is not aligned with any limit I can think of. In a file limited to 2GB, it is at 80% of the lines according to vim.

The test code is

/* Unix */
#include <stdio.h>
#include <stdlib.h>
#include <libfyaml.h>

void to_flow(char **flow, long* length_flow, const char *mixed){
  struct fy_document *fyd = NULL;
  enum fy_emitter_cfg_flags emit_flags = FYECF_MODE_FLOW_ONELINE | FYECF_STRIP_LABELS | FYECF_STRIP_TAGS |FYECF_STRIP_DOC;

  fyd = fy_document_build_from_string(NULL, mixed, -1);
  if (!fyd) {
    *length_flow = -1;
    return;
  }
  int err = fy_document_resolve(fyd);
  if (err) {
    *length_flow = -1;
    return;
  }

  *flow = fy_emit_document_to_string(fyd,emit_flags);
  *length_flow = (long) strlen(*flow);

  fy_document_destroy(fyd);
}

int main(void) {

  char *mixed = 0;
  char *flow = 0;
  long length_mixed;
  long length_flow;
  FILE *f;

  f = fopen("mixed.yaml", "rb");
  if (f)
  {
    fseek (f, 0, SEEK_END);
    length_mixed = ftell (f);
    fseek (f, 0, SEEK_SET);
    mixed = malloc (length_mixed);
    if (mixed)
    {
      fread (mixed, 1, length_mixed, f);
    }
    fclose (f);
  }

  if (mixed)
  {
    printf ("len(mixed) %ld\n",length_mixed);
    to_flow (&flow,&length_flow,mixed);
    printf ("len(flow) %ld\n",length_flow);
    f = fopen ("flow.yaml","w+");
    fputs (flow,f);
    fclose (f);
  }

}
pantoniou commented 1 year ago

Lol, 1TB of memory...

OK, I'll take a look...

Do you mind also doing a uname -a too?

MarDiehl commented 1 year ago

uname -a is Linux maws05 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux, gcc is gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1~20.04).

Note: The test code needed an update, the string was not terminated. Probably this caused the issue

Note2: I edited this post and removed wrong statements.

MarDiehl commented 1 year ago

I have now a working test setup: Two YAML files:

The first one (slightly below the 2GB limit) works, the log from running the test with /usr/bin/time -v is:

len(mixed) 2147483182                                                                               
len(flow) 2023268549                                                                                
  Command being timed: "./a.out"                                                                    
  User time (seconds): 202.59                                                                       
  System time (seconds): 27.11                                                                      
  Percent of CPU this job got: 92%                                                                  
  Elapsed (wall clock) time (h:mm:ss or m:ss): 4:07.95                                              
  Average shared text size (kbytes): 0                                                              
  Average unshared data size (kbytes): 0                                                            
  Average stack size (kbytes): 0                                                                    
  Average total size (kbytes): 0                                                                    
  Maximum resident set size (kbytes): 68544944                                                      
  Average resident set size (kbytes): 0                                                             
  Major (requiring I/O) page faults: 0                                                              
  Minor (reclaiming a frame) page faults: 17135849                                                  
  Voluntary context switches: 47357                                                                 
  Involuntary context switches: 21979                                                               
  Swaps: 0                                                                                          
  File system inputs: 0                                                                             
  File system outputs: 3951712                                                                      
  Socket messages sent: 0                                                                           
  Socket messages received: 0                                                                       
  Signals delivered: 0                                                                              
  Page size (bytes): 4096                                                                           
  Exit status: 0 

top reports about 6.5% memory on a 1TB machine. This is consistent with the amount of memory reported by /usr/bin/time.

With the second file, the code stalls. top reports 0.2% memory usage.

The test code is a fixed version of the code given above:

/* Unix */                                                                                          
#include <stdio.h>                                                                                  
#include <stdlib.h>                                                                                 
#include <libfyaml.h>                                                                               

void to_flow(char **flow, long* length_flow, const char *mixed){                                    
  struct fy_document *fyd = NULL;                                                                   
  enum fy_emitter_cfg_flags emit_flags = FYECF_MODE_FLOW_ONELINE | FYECF_STRIP_LABELS | FYECF_STRIP_TAGS |FYECF_STRIP_DOC;

  fyd = fy_document_build_from_string(NULL, mixed, -1);                                             
  if (!fyd) {                                                                                       
    *length_flow = -1;                                                                              
    return;                                                                                         
  }                                                                                                 
  int err = fy_document_resolve(fyd);                                                               
  if (err) {                                                                                        
    *length_flow = -1;                                                                              
    return;                                                                                         
  }                                                                                                 

  *flow = fy_emit_document_to_string(fyd,emit_flags);                                               
  *length_flow = (long) strlen(*flow);                                                              

  fy_document_destroy(fyd);                                                                         
}                                                                                                   

int main(void) {                                                                                    

  char *mixed = 0;                                                                                  
  char *flow = 0;                                                                                   
  long length_mixed;                                                                                
  long length_flow;                                                                                 
  FILE *f;                                                                                          

  f = fopen("mixed.yaml", "rb");                                                                    
  if (f)                                                                                            
  {                                                                                                 
    fseek (f, 0, SEEK_END);                                                                         
    length_mixed = ftell (f);                                                                       
    fseek (f, 0, SEEK_SET);                                                                         
    mixed = malloc (length_mixed+1);                                                                
    if (mixed)                                                                                      
    {                                                                                               
      fread (mixed, 1, length_mixed, f);                                                            
    }                                                                                               
    mixed[length_mixed] = *"\0";                                                                    
    fclose (f);                                                                                     
  }                                                                                                 

  if (mixed)                                                                                        
  {                                                                                                 
    printf ("len(mixed) %ld\n",length_mixed);                                                       
    to_flow (&flow,&length_flow,mixed);                                                             
    printf ("len(flow) %ld\n",length_flow);                                                         
    f = fopen ("flow.yaml","w+");                                                                   
    fputs (flow,f);                                                                                 
    fclose (f);                                                                                     
  }                                                                                                 

}  
MarDiehl commented 1 year ago

I did some basic "debugging by print" and figured out that fy_document_builder_load_document in by-docbuilder.c gets stuck in it's while loop for large files.

I added a print statement:

  while (!fy_document_builder_is_document_complete(fydb) &&
    (fyep = fy_parse_private(fyp)) != NULL) {
    rc = fy_document_builder_process_event(fydb, fyep);
    printf("rc %d\n",rc);
    fy_parse_eventp_recycle(fyp, fyep);
    if (rc < 0) {
      fyp->stream_error = true;
      return NULL;
    }
  }

for files < 2GB, the screen is full with output (rc seems to be zero all the time), for > 2GB, there is only one print but the loop does not terminate.

MarDiehl commented 1 year ago

some more information: in the second iteration, fy_parse_internal in fy_parse_private does not return.

MarDiehl commented 1 year ago

the loop that never leaves is in fy_reader_skip_space

pantoniou commented 1 year ago

Possibly addressing this issue.

Please test and report back.

Also note that you don't have to read that full file in memory - if you use the file methods for loading the document the file will just be memory mapped and will be demand paged brought it.

It's easy to overlook those things with 1TB of memory I guess :)

MarDiehl commented 1 year ago

works, many thanks for the quick fix!

A few observations from testing this that could be helpful

  1. It works for a 14GB string, so besides main memory there should be no (practical) limits.
  2. Memory consumption is high: Converting the string requires 450 GiB. The string is stored twice, which amounts 'only' to 27 GiB.
  3. When using Intel icx (the new, LLVM-based compiler) I get a SIGILL error. I will try with different compilation options, for icc, icx, and gcc to see if I get can more information.

I certainly will consider a more memory-efficient alternative. So far, the memory consumption of the configuration file was not an issue, but for the specific problem I'm working on it became the bottleneck. It needs a few non-trivial adjustments in my code and I need to figure out how much it actually saves in comparison to the overall memory required to handle the data.

pantoniou commented 1 year ago

You are essentially using the convenience interfaces, so these are using more memory.

When you are talking about a 14GB string, is that the whole yaml file as a string? Or is it a single string that's taking 14GB.

There are a number of interfaces that allow zero re-allocation of memory and can even use memory mapped files as input. namely the fy_token_iter* family of methods.

These allow you to operate without any extra memory allocations at all, and without having to bring the whole file to memory.

In a couple of week I will push a new API that would be schema driven, which in theory would allow very efficient operation (as long as you can describe your data using a schema).

MarDiehl commented 1 year ago

It is a file of 14GiB that I read into a string to convert it to flow style. Basically, I'm misusing libfyaml as a preprocessor to convert an YAML file with arbitrary syntax into a single line flow style string in which all references are resolved. Our Fortran parser lacks the capabilities for complex YAML features but happily accepts this flow style string.

As far as I understand, the other interfaces save memory by avoiding to read the string into memory in the first place. But if my understanding is correct, that would save 14GiB out of 450GiB. Or are there other savings that I miss?

Below is a file that is exemplary for my use case: The root is a dictionary that contains 2 dictionaries (phase and homogenization) and one list (material). Both dictionaries remain small, but the list can have many entries, 33 Mio in the case of the 14GiB string.

The schema driven approach sounds good because the layout in material is relatively fixed.

---
homogenization:
  SX:
    N_constituents: 1
    mechanical: {type: pass}

phase:
  Aluminum:
    lattice: cF
    mechanical:
      output: [F, P, F_e, F_p, L_p, O]
      elastic: {type: Hooke, C_11: 106.75e+9, C_12: 60.41e+9, C_44: 28.34e+9}
      plastic:
        type: phenopowerlaw
        N_sl: [12]
        a_sl: 2.25
        atol_xi: 1.0
        dot_gamma_0_sl: 0.001
        h_0_sl-sl: 75.e+6
        h_sl-sl: [1, 1, 1.4, 1.4, 1.4, 1.4, 1.4]
        n_sl: 20
        output: [xi_sl]
        xi_0_sl: [31.e+6]
        xi_inf_sl: [63.e+6]

material:
  - homogenization: SX
    constituents:
      - phase: Aluminum
        v: 1.0
        O: [1.0, 0.0, 0.0, 0.0]
  - homogenization: SX
    constituents:
      - phase: Aluminum
        v: 1.0
        O: [0.7936696712125002, -0.28765777461664166, -0.3436487135089419, 0.4113964260949434]
        V_e: [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0]]
  - homogenization: SX
    constituents:
      - phase: Aluminum
        v: 1.0
        O: [0.3986143167493579, -0.7014883552495493, 0.2154871765709027, 0.5500781677772945]
        V_e: [[0.999, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0]]
  - homogenization: SX
    constituents:
      - phase: Aluminum
        v: 1.0
        O: [0.28645844315788244, -0.022571491243423537, -0.467933059311115, -0.8357456192708106]
        V_e: [[0.999, 0.0, 0.0], [0.0, 0.999, 0.0], [0.0, 1.0, 0.0]]
MarDiehl commented 1 year ago

I've tested with gcc, icc, and icx with and without debug flags and with two files: Just below and just above 4GiB. The results are the same: The sanitizer of gcc reports a memory issue (gcc debug build). Since it seems that there is enough free memory, I would consider it rather a bug or limitation of the memory sanitizer. The icx debug build terminates with SIGKILL. This could also be a compiler limitation. All other jobs are fine.

Build script and log files are attached:

run_sh.txt 4GiB+.log 4GiB-.log

pantoniou commented 1 year ago

https://github.com/pantoniou/libfyaml/commit/f7493107ee2bec6cca1ca300a3dcffc37ce3cb07

This is a patch that supports streaming alias resolution.

Can you give it a try to resolve one of your files with:

$ fy-tool --dump --streaming --resolve <input.yaml> > <output.yaml>

It should have dramatically smaller memory requirements.

MarDiehl commented 1 year ago

I tried with b045020816676c5148e03babb8350a1ec3c72a8a and a 7.5 GB file. fy-tool --dump --streaming --resolve results in a maximum resident set size of 162 GB in comparison to 311 GB for the non-streaming version (fy-tool --dump --resolve).

The full log is


maws05 ➜  libfyaml_test /usr/bin/time -v fy-tool --dump --streaming --resolve material.yaml > material_out.yaml
    Command being timed: "fy-tool --dump --streaming --resolve material.yaml"
    User time (seconds): 1049.68
    System time (seconds): 54.69
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 18:25.06
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 170267880
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 40746895
    Voluntary context switches: 4
    Involuntary context switches: 107087
    Swaps: 0
    File system inputs: 0
    File system outputs: 18647328
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
maws05 ➜  libfyaml_test /usr/bin/time -v fy-tool --dump --resolve material.yaml > material_out2.yaml 
    Command being timed: "fy-tool --dump --resolve material.yaml"
    User time (seconds): 1266.27
    System time (seconds): 282.59
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 25:50.06
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 326518260
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 159496931
    Voluntary context switches: 5
    Involuntary context switches: 149336
    Swaps: 0
    File system inputs: 0
    File system outputs: 18647328
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
maws05 ➜  libfyaml_test ls -lah
total 26G
drwxr-xr-x  2 m.diehl ma  4.0K Mar 26 18:03 .
drwxr-xr-x 72 m.diehl msu 4.0K Mar 26 18:04 ..
-rw-r--r--  1 m.diehl ma  7.5G Mar  9 02:11 material.yaml
-rw-r--r--  1 m.diehl ma  8.9G Mar 26 16:22 material_out.yaml
-rw-r--r--  1 m.diehl ma  8.9G Mar 26 16:48 material_out2.yaml
MarDiehl commented 1 year ago

a 'quick' comparison to Python (libyaml I assume): 493 GB and a runtime of almost 4h (compared to <20min for streaming):

maws05 ➜  libfyaml_test /usr/bin/time -v ./test.py 
    Command being timed: "./test.py"
    User time (seconds): 13481.90
    System time (seconds): 681.39
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:57:44
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 517126628
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 354181061
    Voluntary context switches: 176889
    Involuntary context switches: 1362762
    Swaps: 0
    File system inputs: 0
    File system outputs: 17377800
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
#!/usr/bin/env python3

import yaml                                                                                         

x = yaml.load(open('material.yaml','r'), Loader=yaml.CSafeLoader)
with open('material_out3.yaml','w') as f:
    f.write(yaml.dump(x,Dumper=yaml.CSafeDumper))
pantoniou commented 1 year ago

I tried with b045020 and a 7.5 GB file. fy-tool --dump --streaming --resolve results in a maximum resident set size of 162 GB in comparison to 311 GB for the non-streaming version (fy-tool --dump --resolve).

The full log is


maws05 ➜  libfyaml_test /usr/bin/time -v fy-tool --dump --streaming --resolve material.yaml > material_out.yaml
  Command being timed: "fy-tool --dump --streaming --resolve material.yaml"
  User time (seconds): 1049.68
  System time (seconds): 54.69
  Percent of CPU this job got: 99%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 18:25.06
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 170267880
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 40746895
  Voluntary context switches: 4
  Involuntary context switches: 107087
  Swaps: 0
  File system inputs: 0
  File system outputs: 18647328
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0
maws05 ➜  libfyaml_test /usr/bin/time -v fy-tool --dump --resolve material.yaml > material_out2.yaml 
  Command being timed: "fy-tool --dump --resolve material.yaml"
  User time (seconds): 1266.27
  System time (seconds): 282.59
  Percent of CPU this job got: 99%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 25:50.06
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 326518260
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 159496931
  Voluntary context switches: 5
  Involuntary context switches: 149336
  Swaps: 0
  File system inputs: 0
  File system outputs: 18647328
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0
maws05 ➜  libfyaml_test ls -lah
total 26G
drwxr-xr-x  2 m.diehl ma  4.0K Mar 26 18:03 .
drwxr-xr-x 72 m.diehl msu 4.0K Mar 26 18:04 ..
-rw-r--r--  1 m.diehl ma  7.5G Mar  9 02:11 material.yaml
-rw-r--r--  1 m.diehl ma  8.9G Mar 26 16:22 material_out.yaml
-rw-r--r--  1 m.diehl ma  8.9G Mar 26 16:48 material_out2.yaml

Well, it's much better, but I guess the vmstats are not quite right. What you're seeing is an artifact of how the malloc implementation prefers to allocate memory instead of trying to re-use it. You are never using that much memory.

Can you try using the --null-output option and reporting back?

It should only report the parser overhead then, and not the emitter.

pantoniou commented 1 year ago

Something did not sit right with me so I run a more thorough debugging session.

Turns out (some) tokens were not released to the correct recycling list when emitting.

Please try streaming mode with https://github.com/pantoniou/libfyaml/commit/378c84f16018baf63c1675e21f53f24b09c5d082

Things should be dramatically better (as in almost no memory consumption over what the input file requires).

P.S. I do enjoy our back and forth, makes for a better YAML processor.

MarDiehl commented 1 year ago

Impressive!

39f774503b4bd84ea595b5bd98e47a35736627b6 gives

maws05 ➜  /tmp /usr/bin/time -v fy-tool --dump --streaming --resolve material.yaml > material_out.yaml   
        Command being timed: "fy-tool --dump --streaming --resolve material.yaml"
        User time (seconds): 776.26
        System time (seconds): 16.19
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 13:14.20
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 7760872
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 160
        Minor (reclaiming a frame) page faults: 181394
        Voluntary context switches: 304
        Involuntary context switches: 77422
        Swaps: 0
        File system inputs: 15357568
        File system outputs: 18647344
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
maws05 ➜  /tmp ls -lah materi*
-rw-r--r-- 1 m.diehl ma 7.5G Apr  6 01:03 material.yaml
-rw-r--r-- 1 m.diehl ma 8.9G Apr 13 20:12 material_out.yaml

which means 7.4 GB for a 7.5 GB file.