matloff / partools

Tools to aid coding in the R 'parallel' package.
40 stars 11 forks source link

Idea: Store metadata with files #8

Open clarkfitzg opened 7 years ago

clarkfitzg commented 7 years ago

It might be nice if filesave stored metadata along with each data file describing column types, number of rows, presence of NA's, etc. This information could potentially be used for more efficient reads.

matloff commented 7 years ago

Nice idea. From the point of view of Software Alchemy, it would be very important to record whether the file records can be considered randomly arranged, vs. sorted. In the latter case, what field(d)? But I have reservations.

Where would the metadata be stored? Presumably in the file itself, in which case we would need a sentinel to indicate where the real data begins. That in turn becomes a problem if the real data includes the sentinel! There are ways around this, e.g. have 2 sentinels mean 1 real one, etc., but these are things to contend with.

Second, in typical cases the data to be written by filesave() comes from an original large file, without the metadata. There would now be an inconsistency, e.g. not being able to simply cat the individual files together to get the original one.

How about this alternate approach? We make available an option to filesave() which is a wrapper R's save(), with the metadata being stored as attributes in the data frame before being fed into save(). We'd add an option to fileread() similarly.

Norm

On Wed, Apr 05, 2017 at 11:37:41AM -0700, Clark Fitzgerald wrote:

It might be nice if filesave stored metadata along with each data file describing column types, number of rows, presence of NA's, etc. This information could potentially be used for more efficient reads.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/matloff/partools/issues/8

clarkfitzg commented 7 years ago

Where would the metadata be stored?

I was more thinking a separate file named according to the filename. Something like this, which could be called x.json:

{
    "nchunks": 17,
    "random_order": true,
    "nrows": 175,
    "colNames": ["height", "age"],
    "colClasses": ["numeric", "integer"],
    "NA_exist": false,
    "format": "text",
    "delimiter": "|",
    "chunks": {
        1: {
            "nrows": 10,
            "filename": "x1.txt"
        },
        2: {
            "nrows": 10,
            "filename": "x2.txt"
        }
        ....
        17: {
            "nrows": 5,
            "filename": "x2.txt"
        }
    }
}

This leaves you free to cat the files back together.

Using save() also sounds like a good idea since it saves the work of having to parse the text twice. Then one would have several chunks in .Rdata files. Going beyond that, you could even let the user choose their serialization format for the chunks, ie. feather.

With any of these I think that having metadata stored separately as above would be useful, since we can very cheaply read the metadata and get some notion of how to efficiently perform the computation.

clarkfitzg commented 7 years ago

Long term I'm thinking about making computations lazy, so that one can analyze the R code together with the data sizes and come up with a potentially more efficient execution. This is along the lines of other systems like Spark and dask, which don't do anything until one calls compute().

But this is more ambitious, and could be even be a different project that uses partools as a dependency.

matloff commented 7 years ago

That would be fine. We could have just one file for that, by the way, non-distributed.

I got an error message from Travis, no time to look into it now.

Norm

On Wed, Apr 05, 2017 at 02:09:04PM -0700, Clark Fitzgerald wrote:

Where would the metadata be stored?

I was more thinking a separate file named according to the filename. Something like this, which could be called x.json:

{
    "nchunks": 17,
    "random_order": true,
    "nrows": 175,
    "colNames": ["height", "age"],
    "colClasses": ["numeric", "integer"],
    "NA_exist": false,
    "format": "text",
    "delimiter": "|",
    "chunks": {
        1: {
            "nrows": 10,
            "filename": "x1.txt"
        },
        2: {
            "nrows": 10,
            "filename": "x2.txt"
        }
        ....
        17: {
            "nrows": 5,
            "filename": "x2.txt"
        }
    }
}

This leaves you free to cat the files back together.

Using save() also sounds like a good idea since it saves the work of having to parse the text twice. Then one would have several chunks in .Rdata files. Going beyond that, you could even let the user choose their serialization format for the chunks, ie. feather.

With any of these I think that having metadata stored separately as above would be useful, since we can very cheaply read the metadata and get some notion of how to efficiently perform the computation.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/matloff/partools/issues/8#issuecomment-291997786

clarkfitzg commented 7 years ago

We could have just one file for that, by the way, non-distributed.

Yes, that's what I had in mind also.

The error message from Travis is because it doesn't pass R CMD check. I'll look at it now.