jad-b / matchwell

Classify hierarchies of labeled text
Mozilla Public License 2.0
0 stars 0 forks source link

Standardize interface for pulling data #1

Open jad-b opened 8 years ago

jad-b commented 8 years ago

Background

Since we care about dynamic datasets, we need a way of pulling in changes. With Gmail being the only source, it's not so bad, but we should experiment with the required interface before we go to add web scraping.

Proposal

Standardize the data retrieval from a source behind an interface class.

import abc
class  Sourcerer(abc.ABC):
    """Sourcerer's can pull data from a data source."""
    #: A one-word descriptive name for this source of data.
    name = None

    @abc.abstractmethod
    def pull(df=None, **kwargs):
        """Retrieve updates from the data source.

        Args:
            df (:class:`pandas.DataFrame`): DataFrame to use as a base of reference. 
                Not providing one indicates a full copy from the source is to be performed, if possible.
        Returns:
            A new :class:`pandas.DataFrame`, of only this source's type.
        """
        pass

Currently, only the Gmail source exists, so implementation will be easy.

class GmailSource:
    name = 'gmail'

    def __init__(self, gmail=None):
        self.gmail = gmail or gmail.Gmail()  # Run initialization.

    def pull(df, only_newer=True, **kwargs):
        # Do a diff against the data frame
        # Grab new emails
        raise NotImplementedError

Usage:

>>> gs = GmailSourcer()
>>> print(gs.name)
'gmail'
>>> gmail_df = gs.pull(df)
>>> df.merge([gmail_df])
>>> # Voila!
jad-b commented 8 years ago

After touching on this implementation this morning, I think modifying Gmail.list_messages to be a generator will assist in this endeavour, as well as any other API calls that require iterating through paginated responses.