Introspection in Fortran for generic file I/O libraries (TOML, JSON, NPZ, etc.)

certik commented 6 months ago

Originally discussed at

The idea is to use the simple and compiler-enforced syntax of namelist (or equivalent), but the compiler would call a user library that implements other formats, such as TOML, JSON, or custom binary array formats (say npy/npz, GGUF, safetensors, etc.).

To be figured out is the exact design how this would work.

As an example how Rust approaches this problem: the toml library there allows you to just create a struct and decorate it:

#[derive(Deserialize)]
struct Config {
   ip: String,
   port: Option<u16>,
   keys: Keys,
}

Then call it like this:

let config: Config = toml::from_str(r#"
   ip = '127.0.0.1'

   [keys]
   github = 'xxxxxxxxxxxxxxxxx'
   travis = 'yyyyyyyyyyyyyyyyy'
"#).unwrap();

and it will just work.

A similar feature in Fortran might look like:

type(toml_file) :: toml 
type(mytype) :: t
call toml%load(t, 'file.toml')

Or using the Fortran's namelist like syntax:

namelist / t / A, B, C
open(newunit=u, file="file.toml", status="old", custom_reader=toml_file)
read(u, t)

And this allows you to implement a user derived type toml_file that implements all the necessary capability to read a custom format, and then the line read(u, t) makes the compiler call your function/type bound procedures to actually handle the read.

davidpfister commented 6 months ago

It would be a nice featrure, especially to really make use of UDDTIO and point to various serializers. C# or VB have build in reflection, so this comes naturally. While in these languages you can also use reflection to call methods in compiled libraries that you don't own, the serialization/deserialization is definitely the most useful. For fortran, I gave it some thought recently, i.e. how to mimic reflection (or at least introspection) in fortran. I investigated different solution involving a lot of c_loc, c_f_pointer, storage_size and transfer. I was even ready to somehow compile asr generated with lfortran into derived types and dynamically create dictionaries of component names as key and pointers to components as value. But in the end I faced the problem that components in derived types can be reordered in memory by the compiler. As such, the approach would be restricted to derived types declared with the sequence attribute. Preprocessing the derived types to generate this dictionary would also be possible

type(dict) :: mytype_dict
type mytype
    integer:: A, B, C
end type
...
subroutine generate_dict(this)
   class(mytype) :: this

   mytype_dict%set('A') => this%A 
   mytype_dict%set('B') => this%B 
   mytype_dict%set('C') => this%C
end subroutine

So if there is a proposal to do it intrinsically, I would support it.

certik commented 6 months ago

@davidpfister if the compiler is free to reorder the members of a derived type in memory, then this has to be only allowed for the restricted subset with an attribute, as you said. Thanks for playing with ASR and LFortran. I think reflection can be done in Fortran cleanly, all at compile time (so no runtime overhead). I think it could be very powerful and useful, if we can design it well.

davidpfister commented 6 months ago

I must admit that I never used namelist before yesterday. I played around and it seems that a lot can already be done with the current capabilities fo the language. Here is what I came up with using a bit of preprocessing to mimic generics:

module point_m
   enum, bind(C)
      enumerator :: RED
      enumerator :: BLUE
      enumerator :: GREEN  
   end enum

   type, abstract :: object
   end type

   type :: coord_t
      real :: x = 0.0
      real :: y = 0.0
   end type

   type, extends(object) :: point_t
      type(coord_t) :: coord
      integer :: color = RED
   contains
      procedure, pass(this), public :: serialize => serialize_t
      procedure, nopass, public :: deserialize => deserialize_t
   end type

   contains

#define T point_t  
#include <serializable.txt>
#undef T

end module

In the include file you get

subroutine serialize_t(this, str)
    class(T), intent(in), target   :: this
    character(:), allocatable, intent(out) :: str
    !private
    type(T), pointer :: obj => null()
    namelist / ser / obj
    allocate(character(100) :: str)

    obj => this

    write(str, nml=ser)

    str = trim(str)
    nullify(obj)
end subroutine

subroutine deserialize_t(that, str)
    type(T), allocatable, intent(out)   :: that
    character(*), intent(in) :: str
    !private
    type(T) :: obj
    namelist / ser / obj

    read(str, nml=ser)
    allocate(that, source=obj)
end subroutine

and the main program ends up being

program main
   use point_m

   type(point_t), allocatable :: point
   character(:), allocatable :: stream_data

   allocate(point)

   point%coord%x = 1.0d0
   point%coord%y = 2.0d0
   point%color = 1

   call point%serialize(stream_data)
   write(*,*) stream_data
   point%coord%x = 0.0d0
   point%coord%y = 0.0d0
   point%color = 0

   call point%deserialize(point, stream_data)
   write(*,*) point
end program

output

 &SER OBJ%COORD%X=   1.000000    ,OBJ%COORD%Y=   2.000000    ,OBJ%COLOR=
    1/

    1.000000       2.000000               1

so formatting to various output format would mean parsing the namelist stream_data (splitting on ',' and '%') and adding <>, {} or whatever format specific characters. One can easily add a procedure argument to the serialize/deserialize functions that would transform/back transform the string content. From what I see, something pretty neat could be obtained with generics by simply extending the derived type from a generic serializable_t that would contain the serialize/deserialize functions rather than using preprocessing.

That just gave me some cool ideas for a side project 😄

certik commented 6 months ago

@davidpfister this seems to implement custom serialization for any user type, but the format on disk is a namelist format. That's one part of the problem. The other part is to have custom binary formats on disk as well.

jacobwilliams commented 6 months ago

so formatting to various output format would mean parsing the namelist stream_data

You have no idea what a can of worms it is to do that! :)

davidpfister commented 6 months ago

Actually I implemented something similar (to some extent) in C# not so long ago to flatten dictionaries and output them to various backend (json, xml, sqlite). The big difference is that the .NET environment comes with a huge toolbox to create tokenizers and lexers. But I agree, doing it in fortran for the namelist format is certainly a hell of a job. I did not even start adding pointers, allocatables and complex inheritance. On top of this, depending on the desired backend some characters need to be escaped (', &, >, < in xml for instance). So, if the namelist format parser would be a can of worms, what should we say about the 'textformater' for the different backends? 😄 and this is a lot easier since @jacobwilliams you did it already (at least for a subset), right? But if I were to start a project on this topic I would certainly create a parser for the namelist (something similar to f90nml in python) and then output the dictionary to different format. Looks like fun!

davidpfister commented 6 months ago

Well, I got my answer. My approach would not work as soon as you have allocatable or pointer components. @certik, I am afraid that without including the support for allocatables and pointers to namelist (or any kind of read/write, since the same limitation applies to unformatted i/o), that functionality would have a very limited scope. But this is probably material for another proposal.

certik commented 3 months ago

j3-fortran / fortran_proposals

Introspection in Fortran for generic file I/O libraries (TOML, JSON, NPZ, etc.) #331