pantoniou / libfyaml

Fully feature complete YAML parser and emitter, supporting the latest YAML spec and passing the full YAML testsuite.
MIT License
239 stars 73 forks source link

Lifetime of `fy_token_get_text` return value #106

Closed hoehrmann closed 1 month ago

hoehrmann commented 3 months ago

https://pantoniou.github.io/libfyaml/libfyaml.html#fy-token-get-text-description says »That means that the pointer is not guaranteed to be valid after the parser is destroyed.«. Shouldn't this refer to the lifetime of the token rather than the parser? As I understand the documentation, I am not supposed to free the return value, but if the pointer is valid so long as the parser is alive, that would mean, in the worst case, that pretty much the entire input is kept in memory, even when using the event API.

pantoniou commented 3 months ago

Yes, this is correct.

By design libfyaml will not allocate memory by pointing directly back to the input stream, and the parser object will have to be kept around for the lifetime of the 'tokens'.

Yes, using the event API the objects returned are valid as long as the parser object is around by default.

The tokens however are reference counted (and through them the input stream). If you take a reference of the token you will keep the input as long as this reference is valid. This is not a publicly exported interface (on purpose) because it is incredible error prone (keep a single token reference and all the input stays in memory for ever).

FWIW if you want to keep tokens around after the parser is destroyed just malloc (or some other linear allocator) and memcpy the contents in the event API loop.

hoehrmann commented 3 months ago

@pantoniou Thanks for the quick response. Well, I want to process YAML documents that do not fit into memory, so I want to free everything as quickly as possible. The quoted part of the documentation makes me doubt whether fy_parser_event_free is enough to release the allocation that might be caused by a call to fy_token_get_text on a token referenced by the event. If that is enough, then I think the documentation is confusing, if that is not enough, then I am unclear on what needs to be done on top of that.

pantoniou commented 3 months ago

fy_parser_event_free() will free everything that is contained in the event, including it's tokens. If a token has been operated upon via fy_token_get_text() and allocated storage, that storage will be freed too.

The source input which the parser operates on will stay around until the destruction of the parser.

I also want to clarify something as well; when I say the source input is available for the duration of the parser's lifecycle, it does not necessarily means that the input has been read in allocated memory.

For regular files, this means that the file is mmap'ed which means that you won't run out of memory.

Do you care to share some details about your use case? I.e. how large of a file, and what kind of operation you're doing with those data.

Unfortunately the standard libfyaml document model (which is sort-of an evolution of the libyaml document model) is not cutting it nowadays for large files. I am in the middle of a big rewrite that addresses very large file storage and manipulation, but it's not ready. Having some details about what you're trying to do might help with some decisions I have to make shortly.