markgw commented 7 years ago

See below: for the full plan for this enhancement. It incorporates all relevant information from earlier comments.

Major long-term enhancement to restructure how datasets are handled.

Currently, a datatype is defined as a subclass of PimlicoDatatype. It is instantiated within a pipeline and the resulting object represents a particular dataset, providing the necessary methods to read the data, iterate over a corpus, read metadata, etc.

Using types (classes) to represent datatypes comes with a load of problems. It is, for example, difficult to create slight variants on a datatype, providing slightly different reading functionality. This is solved currently by dynamically creating subclasses, but this is limited and unsatisfactory.

Really a datatype should be instantiated, allowing instances to have slightly different behaviour where necessary. Then the behaviour can also be modified at the datatype level (rather than dataset) by modifying this object.

Then, having a separate type for datasets (distinct from datatypes) would be a lot easier to understand. The datatype would still define the dataset's behaviour, which would be made available on the dataset object by giving all the methods, etc. that the datatype sets out for it. We could even maintain very similar syntax to the current system by giving datatype isntances a __call__ method that creates a dataset instance.

markgw commented 7 years ago

I have now thought this through more and am considering the following approach.

Datatypes are represented by classes which, as now, extend the base PimlicoDatatype. Unlike now, these can be instantiated, potentially with options passed in as kwargs which affect the datatype, and they are used in this form (i.e. datatype instance, not class) for typechecking, etc.

Typechecking is done in almost exactly the same way as currently. The default behaviour is to use the type hierarchy of the datatype classes for inheritance. We will still have the current ways of overriding this, e.g. specifying special datatype requirements that can impose other typechecking constraints, or specifying an explicit datatype to use for typechecking that is different from the datatype's class.

Each datatype class will have a nested class Reader. This is where all reading behaviour is defined and it is this class that will be used for the dataset, or reader (corresponding to the instantiated datatype class in the old system).

The base PimlicoDatatype defines a Reader, which is the base class for all readers:

class PimlicoDatatype:
    class Reader:
        def __init__(self, base_dir, datatpye):
            ... etc ...

Other datatypes' Readers do not need to explicitly override this Reader class: this will be handled by Pimlico internally (somewhat like Django Meta classes). A Reader's parent class will always be the Reader of the datatype's parent class.

class MyDatatype:
    class Reader:
        ... reading functionality ...

The old data_ready() method now becomes a classmethod of the Reader. As before, it should call the super class' data_ready() method, which it can do using super() (even though it doesn't look like it has a super class!). This will be called before the reader is instantiated, so the __init__() of a reader may assume that the data is ready to be read, and do any preparatory reading tasks that it wants. Also, data_ready() now takes a path as an argument and checks whether the data is available at that path. The function may then be called multiple times to check data availability at multiple possible paths.

The datatype will have a __call__() method, which will prepare the reader class and instantiate it. This means that the use of a datatype instance is rather similar to that of a datatype class previously (though not identical), since calling the datatype instance with the data path as an argument (which looks rather like instantiating a class) creates the datatype's reader. The reader is then instantiated with the data path and the datatype instance, which will then be available from within the reader in the datatype attribute.

The path passed into the reader when instantiating it should be the absolute path to the data. Readers do not handle the path ambiguity previously built into datatypes. Rather, this is handled by the pipeline, which knows about the various locations where data might live. By the time the reader is created, we have already established the relevant path (or paths) for the data, checked whether the data is available there (using data_ready()) and chosen one of the paths (if multiple).

The rough pattern of usage of datatypes then becomes:

datatype_cls = load_datatype("some python path, e.g. from the pipeline config")
datatype = datatype_cls(pipeline)    # Possibly some kwargs here?

# There may be various possible paths where the data might be located
# This will be taken care of by the pipeline in general
for path in possible_paths:
    if datatype.data_ready(path):
        # Data is available at this path: instantiate the reader
        reader = datatype(path)
        break

How writers work is something I'm still not sure about, but I think it will be essentially the same as readers. This leads to a much tighter coupling between datatypes and writers than previously. It would mean that each datatype class has a nested Writer class, with inheritance working in the same way. Sometimes the writer would be None, as some datatypes don't provide writing functionality and we don't want to force them to.

I propose to remove the "additional datatype" functionality altogether. (a) I've almost never used it and it unnecessarily complicates data reading. (b) Some of the use cases will be now covered by readers and others can be implemented with convertor filter modules, which is easier to understand. There may be a case for reintroducing it later, but it's not used much currently, so it won't hurt to drop it for now.

Sometime soon, I will create a branch for implementing this. The change will be backwards compatible with old pipeline configs and stored data, but not with old code. It will therefore be merge in the next version of Pimlico (0.7).

markgw commented 7 years ago

Note: see comment below for further clarity on how this should be done

I will also need to work out exactly how datatype options work in the new system. Are they actually datatype options (available on a datatype instance) or are they dataset options (available once the data is available)?

My current best proposal is to have both: datatype options, specified when a datatype is instantiated and available for typechecking, etc; and dataset options, available once you're reading the data. A datatype class then specifies two dictionaries defining these different types of options. When a datatype is used to define an input module, options from both sets may be specified as parameters. There may therefore not be any overlap between the names.

How exactly the dataset options work needs a little further thought. I have a feeling that they're only needed when instantiating a dataset as an input type, not using the path option to point to a Pimlico dataset directory. I.e. they provide an alternative way of instantiating a dataset at the start of a pipeline, and nothing else. (The current implementation of this functionality is a bit mixed up and ad hoc, so this would be a great improvement.) Not all datatypes will permit this, so they should specify explicitly when they do (for pipeline checking and documentation).

markgw commented 7 years ago

Regarding input data and dataset options

Dataset options, discussed above, may not be needed at all. Under the current (old) system, they are only used when reading in data -- i.e. using the datatype as an input module.

I now propose the following change. As before, a datatype can be specified as a special case of a module type, to create an input module that reads a dataset from a Pimlico data directory. The only option required for this is the path to the data directory. Everything else is encapsulated in the dataset's metadata, exactly as if the dataset had come from the output of another module.

Up to now, there has been another use of this special module type, not clearly distinguished from the above. You could create an input module that reads data from an arbitrary location in an arbitrary format. The reading was controlled by input module options. I propose that this feature be removed. Instead, we define input module types that read data in from whatever format they want, controlled now by the module options, and produce the appropriate datatype(s) as their output(s).

The advantage of this is that we clearly distinguish the datatype, with its ways of presenting the data, reading it and potentially writing it within a Pimlico pipeline, from the external storage type. There can be many of the latter for a single datatype: e.g. a raw text corpus can be stored in many different ways (outside Pimlico) and each one can be catered for by providing a different input module type, without creating a meaningless distinction in the (Pimlico) datatype of the data that's ready in.

This will mean dropping the input_module_options from datatypes and narrowing the special case of a module type as a datatype to the first use above (with a dir option, and potentially also datatype options, but not dataset options). It will also mean creating a whole load of new built-in module types to provide ways to read in data from common formats. These will all correspond to functionality previously provided by special datatypes, which generally were only ever intended for use as input types (e.g. the XML reader).

markgw commented 7 years ago

6cb7aac

Following the previous comment, I've added a factory to make it easy to create input reader modules for iterable corpora. I've added a first one of these: pimlico.modules.input.text.raw_text_files.

Something roughly like this is how all data should be read in at the start of a pipeline in future. Then input_module_options, and the corresponding use of datatypes as a module type in config files, can be dropped altogether, in the new datatype branch.

markgw commented 6 years ago

Regarding iterable corpora

Essentially, the current iterable corpus type and its commonly used subtype tarred corpus work well. (Though I'd like to rename TarredCorpus to e.g. GroupedCorpus.)

There's a problem with inheritance, though, in the document types (i.e. the embedded types). Typechecking ensures that the provided corpus is a subtype of that required (e.g. IterableCorpus) and further that the provided document type is a subtype of that (or one of those) required. This implies that, if the provided document type is a subtype of that required, we'll be able to use it, as with normal OO inheritance. However, this is not true. Often the subtypes alter the internal data structure that's used, meaning that the supertype's data structure is incompatible.

E.g. tokenized text is a subtype of text, which makes sense, as it provides all the text, but also something more specific. However, the data structure it provides to a module is a list of tokens, so it is provided where a textual corpus is expected, we get an unexpected data structure, even though in principle all the information we need is there.

A solution

One way to solve this is to stop using arbitrary data structures to represent the document data internally. Instead, use a scheme identical to the corpus typing, described above. Document datatypes are python instances. But a document itself is an instance that can be formed from the datatype, given the document's data. This is done by a method like the current process_document(), but the result will always be a data structure in the same hierarchy, meaning that if the document type inherits from something, it will provide all the same methods and attributes.

The data should then always be accessed by the document's methods and attributes. A subtype then might change the way that data is stored, or add new annotations, etc, but still guarantees to provide everything that the supertype provides.

E.g. a text document might provide the text in an attribute text. A tokenized document can now provide the tokenized text in another attribute, but still needs to provide the text attribute, which it will do simply by joining the tokens. Now something that expects a text document can be satisfied if it receives a tokenized document, because with this proper inheritance everything that it expects is fulfilled.

markgw commented 6 years ago

This describes a major long-term enhancement to restructure how datasets are handled. This comment constitutes a plan for the enhancement and incorporates all of the relevant parts of the above comments.

Outline

Currently, a datatype is defined as a subclass of PimlicoDatatype. It is instantiated within a pipeline and the resulting object represents a particular dataset, providing the necessary methods to read the data, iterate over a corpus, read metadata, etc.

Using types (classes) to represent datatypes comes with a load of problems. It is, for example, difficult to create slight variants on a datatype, providing slightly different reading functionality. This is solved currently by dynamically creating subclasses, but this is limited and unsatisfactory.

Really a datatype should be instantiated, allowing instances to have slightly different behaviour where necessary. Then the behaviour can also be modified at the datatype level (rather than dataset) by modifying this object.

Main changes

Datatypes are represented by classes which, as now, extend the base PimlicoDatatype. Unlike now, these can be instantiated, potentially with options passed in as kwargs which affect the datatype, and they are used in this form (i.e. datatype instance, not class) for typechecking, etc. We may wish to restrict the kwargs in some way to strings, or to simple JSON types, so that they can be specified in config files.

Typechecking is done in almost exactly the same way as currently. The default behaviour is to use the type hierarchy of the datatype classes for inheritance. We will still have the current ways of overriding this, e.g. specifying special datatype requirements that can impose other typechecking constraints, or specifying an explicit datatype to use for typechecking that is different from the datatype's class.

Each datatype class will have a nested class Reader. This is where all reading behaviour is defined and it is this class that will be used for the dataset, or reader (corresponding to the instantiated datatype class in the old system).

The base PimlicoDatatype defines a Reader, which is the base class for all readers:

class PimlicoDatatype:
    class Reader:
        def __init__(self, base_dir, datatpye):
            ... etc ...

Other datatypes' Readers do not need to explicitly override this Reader class: this will be handled by Pimlico internally (somewhat like Django Meta classes). A Reader's parent class will always be the Reader of the datatype's parent class.

class MyDatatype:
    class Reader:
        ... reading functionality ...

The old data_ready() method now becomes a method of the datatype, since the Reader is only used once the data is ready. As before, it should call the super class' data_ready() method, which it can do using super(). This will be called before the reader is instantiated, so the __init__() of a reader may assume that the data is ready to be read, and do any preparatory reading tasks that it wants. Also, data_ready() now takes a path as an argument and checks whether the data is available at that path. The function may then be called multiple times to check data availability at multiple possible paths.

The datatype will have a __call__() method, which will prepare the reader class and instantiate it. This means that the use of a datatype instance is rather similar to that of a datatype class previously (though not identical), since calling the datatype instance with the data path as an argument (which looks rather like instantiating a class) creates the datatype's reader. The reader is then instantiated with the data path and the datatype instance, which will then be available from within the reader in the datatype attribute.

The path passed into the reader when instantiating it should be the absolute path to the data. Readers do not handle the path ambiguity previously built into datatypes. Rather, this is handled by the pipeline, which knows about the various locations where data might live. By the time the reader is created, we have already established the relevant path (or paths) for the data, checked whether the data is available there (using data_ready()) and chosen one of the paths (if multiple).

The rough pattern of usage of datatypes then becomes:

datatype_cls = load_datatype("some python path, e.g. from the pipeline config")
datatype = datatype_cls(pipeline)    # Possibly some kwargs here?

# There may be various possible paths where the data might be located
# This will be taken care of by the pipeline in general
for path in possible_paths:
    if datatype.data_ready(path):
        # Data is available at this path: instantiate the reader
        reader = datatype(path)
        break

Writers

Writers will work in essentially the same way as readers. This leads to a much tighter coupling between datatypes and writers than previously. It means that each datatype class has a nested Writer class, with inheritance working in the same way. Sometimes the writer would be None, as some datatypes don't provide writing functionality and we don't want to force them to.

Input data and dataset options

Dataset options, discussed above, may not be needed at all. Under the old system, they are only used when reading in data -- i.e. using the datatype as an input module. This will change as follows. As before, a datatype can be specified as a special case of a module type, to create an input module that reads a dataset from a Pimlico data directory. The only option required for this is the path to the data directory. Everything else is encapsulated in the dataset's metadata, exactly as if the dataset had come from the output of another module.

Up to now, there has been another use of this special module type, not clearly distinguished from the above. You could create an input module that reads data from an arbitrary location in an arbitrary format. The reading was controlled by input module options. This feature will now be removed. Instead, we define explicit input module types that read data in from whatever format they want, controlled now by the module options, and produce the appropriate datatype(s) as their output(s).

We now clearly distinguish the datatype, with its ways of presenting the data, reading it and potentially writing it within a Pimlico pipeline, from the external storage type. There can be many of the latter for a single datatype: e.g. a raw text corpus can be stored in many different ways (outside Pimlico) and each one can be catered for by providing a different input module type, without creating a meaningless distinction in the (Pimlico) datatype of the data that's read in.

This will mean dropping the input_module_options from datatypes and narrowing the special case of a module type as a datatype to the first use above (with a dir option, and potentially also datatype options, but not dataset options). It means creating a load of new built-in module types to provide ways to read in data from common formats. These will all correspond to functionality previously provided by special datatypes, which generally were only ever intended for use as input types (e.g. the XML reader). This phasing process has already begun and input datatypes are considered deprecated. Many have been replaced by input modules already and this branch will complete the phasing process and remove the functionality that allows special input modules to be created from datatypes in this way.

6cb7aac adds a factory to make it easy to create input reader modules for iterable corpora. I've added a first one of these: pimlico.modules.input.text.raw_text_files. Something roughly like this is how all data should be read in at the start of a pipeline in future.

To-do: I still need to work out how datatype options will be specified in a config file.

Typing of iterable corpora

A problem

Essentially, the current iterable corpus type and its commonly used subtype tarred corpus work well. (Though I'd like to rename TarredCorpus to e.g. GroupedCorpus.) There's a problem with inheritance, though, in the document types (i.e. the embedded types).

Typechecking ensures that the provided corpus is a subtype of that required (e.g. IterableCorpus) and further that the provided document type is a subtype of that (or one of those) required. This implies that, if the provided document type is a subtype of that required, we'll be able to use it, as with normal OO inheritance. However, this is not true. Often the subtypes alter the internal data structure that's used, meaning that the supertype's data structure is incompatible.

E.g. tokenized text is a subtype of text, which makes sense, as it provides all the text, but also something more specific. However, the data structure it provides to a module is a list of tokens, so it is provided where a textual corpus is expected, we get an unexpected data structure, even though in principle all the information we need is there.

The solution

We will solve this by stopping using arbitrary data structures to represent the document data internally. Instead, we use a scheme identical to the corpus typing, described above. Document datatypes are python instances. But a document itself is an instance that can be formed from the datatype, given the document's data. This is done by a method like the current process_document(), but the result will always be a data structure in the same hierarchy, meaning that if the document type inherits from something, it will provide all the same methods and attributes.

The data should then always be accessed by the document's methods and attributes. A subtype then might change the way that data is stored, or add new annotations, etc, but still guarantees to provide everything that the supertype provides.

E.g. a text document might provide the text in an attribute text. A tokenized document can now provide the tokenized text in another attribute, but still needs to provide the text attribute, which it will do simply by joining the tokens. Now something that expects a text document can be satisfied if it receives a tokenized document, because with this proper inheritance everything that it expects is fulfilled.

Additional datatypes removed

We will remove the "additional datatype" functionality altogether. (a) I've almost never used it and it unnecessarily complicates data reading. (b) Some of the use cases will be now covered by readers and others can be implemented with convertor filter modules, which is easier to understand. There may be a case for reintroducing it later, but it's not used much currently, so it won't hurt to drop it for now.

Implementation

I will create a branch for implementing this, called datatypes. The change will be backwards compatible with old pipeline configs and stored data, but not with old code. It will therefore be merged in the next version of Pimlico (0.9).

An exception is that we will probably not maintain backwards compatibility in the case of input datatypes, since these are to be removed and replaced by input modules. They are already (to some degree at least) considered deprecated and are being phased out.

markgw commented 6 years ago

Created branch datatypes

markgw commented 6 years ago

The core functionality is complete now, as of some time ago. There are a lot of modules and datatypes still to update to the new system, but that's covered by other issues.

The datatypes branch was merged into master, so master now incorporates the new datatypes system. This means that any modules not yet updated need to be updated before they can be used. It also means that all new Pimlico projects are forced to use the new (and much better) system.

markgw / pimlico

Redesign datatype functionality #1

Regarding input data and dataset options

Regarding iterable corpora

A solution

Outline

Main changes

Writers

Input data and dataset options

Typing of iterable corpora

A problem

The solution

Additional datatypes removed

Implementation