glotzerlab / signac

Manage large and heterogeneous data spaces on the file system.
https://signac.io/
BSD 3-Clause "New" or "Revised" License
129 stars 36 forks source link

Collection files and JSON #104

Closed csadorf closed 5 years ago

csadorf commented 5 years ago

Original report by Bradley Dice (Bitbucket: bdice, GitHub: bdice).


Currently collections don't output valid JSON files, just "very close to valid JSON files." Collections use a text file of JSON dictionaries separated by newlines, while a valid JSON file would only need len(collection)+2 more bytes for square braces and commas.

I'm not sure why this design decision was made, but it seems that there would be a clear advantage to using a valid JSON file (so long as there is no serious performance penalty).

It would be nice if there was a better way to use Collections with a JSON backend. Currently I have a workaround to save a JSON file: json.dump([i for i in collection.find()], open('valid.json', 'w'))

csadorf commented 5 years ago

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


Yes, that is a deliberate design decision.

  1. The JSON format is inefficient for very large files, because it is impossible to parse only sections of the file using standard parsers. That means we are forced to load the whole file into memory when parsing, which might be problematic for very large files.
  2. Conceptually, a collection is a collection of JSON-files, not one big JSON-file.
  3. A new-line delimited concatenation of JSON-files is the standard dump-format of mongoexport. That means it is possible to import a collection file directly into a MongoDB database collection.
csadorf commented 5 years ago

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


How about we add a Collection.to_json() method to mitigate this?

csadorf commented 5 years ago

Original comment by Bradley Dice (Bitbucket: bdice, GitHub: bdice).


I think that Collection.to_json() is a good idea. Should it return a string, or should it write to a provided filename or file_object argument? It should probably have a corresponding from_json method that works sort of like open.

Thanks for explaining that design decision. It makes more sense now.

csadorf commented 5 years ago

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


The Collection.to_json() method should return a string. The corresponding Collection.from_json() should be a classmethod, accept a string-argument, and return an instance of Collection.

I don't think providing a write() method or similar that writes directly to a file-object makes much sense, because we will need to create the string anyways. It is one of the weaknesses of the JSON-format that we can't really stream it well.

csadorf commented 5 years ago

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


I've reduced the priority of this enhancement proposal.

csadorf commented 5 years ago

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


@bdice Do you want to take this on or do you want me to find somebody else to implement this?

csadorf commented 5 years ago

Original comment by Bradley Dice (Bitbucket: bdice, GitHub: bdice).


@csadorf I may take it on soonish, for my needs in the photonics project. I would strongly prefer the ability to load/save directly to/from files, mirroring the style of pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

csadorf commented 5 years ago

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


Ok, with that precedent I agree. Let's mirror the DataFrame API. Please feel free to assign this issue to you yourself, once you want to work on it.

csadorf commented 5 years ago

There has been some debate on how exactly to implement this. The latest decision was to replicate the pandas.DataFrame API with respect to JSON export and import as closely as possible. I think that's a good idea, but please feel free to suggest a deviation if you think you have a compelling reason.

vyasr commented 5 years ago

@bdice are you still interested in taking this on?

xadams commented 5 years ago

Sorry for the delay, have been busy with prelim prep and wrapping things up before Heather's departure. I am planning to spend this Thursday working on it, but will hand it over if it the need is more urgent.

On Tue, May 28, 2019 at 10:32 AM vyasr notifications@github.com wrote:

@bdice https://github.com/bdice are you still interested in taking this on?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/glotzerlab/signac/issues/104?email_source=notifications&email_token=AFPT3TNDVIE6SUO4ENU6PITPXU66HA5CNFSM4G6RLZDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMKFQI#issuecomment-496542401, or mute the thread https://github.com/notifications/unsubscribe-auth/AFPT3TMRMKGS7HGZAPCDA3DPXU66HANCNFSM4G6RLZDA .

-- University of Michigan, Ann Arbor Ph.D. Candidate - Mayes, Glotzer Labs College of Engineering - Chemical Engineering

bdice commented 5 years ago

@xadams That would be awesome - I did some reading on the pandas DataFrame API and this is what we would shoot for (this is approximate / pseudocode):

from .core import json

# In the Collection class, add this method

def to_json(self, file=None):
    json_string = json.dumps([i for i in collection.find()])
    if file is None:
        # return as a JSON string - 
        return json_string
    if isinstance(file, six.string_types):
        # opens the file as a file path
        with open(file, 'w') as json_file:
            # write to that file path
            json_file.write(json_string)
    else:
        file.write(json_string)

@classmethod
def read_json(cls, file=None):
    if isinstance(file, six.string_types):
        with open(file, 'r') as json_file:
            json_data = json.load(json_file)
            # Instantiate and return Collection.
    else:
        json_data = json.load(file)
        # Instantiate and return Collection.
csadorf commented 5 years ago

Resolved with merge of PR #200 .