intel / dffml

The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.
https://intel.github.io/dffml/main/
MIT License
253 stars 138 forks source link

source: file: Compression #15

Closed johnandersen777 closed 5 years ago

johnandersen777 commented 5 years ago

DFFML is hoping to participate in Google Summer of Code (GSoC) under the Python Software Foundation umbrella. You can read all about what this means at http://python-gsoc.org/. This issue, and any others tagged gsoc and project are not generally available bugs, but related to project ideas for GSoC.

Project Idea: File Source Compression

Project description:

DFFML's initial release includes a FileSource which saves and loads data from files using the load_fd and dump_fd methods.

JSON Example

https://github.com/intel/dffml/blob/dd8007d0c9f8c58c35c94faf148e2b5d6ce4c101/dffml/source/json.py#L19-L27

For the open method of FileSource

https://github.com/intel/dffml/blob/dd8007d0c9f8c58c35c94faf148e2b5d6ce4c101/dffml/source/file.py#L36-L44

Allow for reading and writing the following file formats, transparently (so without subclasses having to do anything) to any source which is a subclass of FileSource.

Skills: Python, git Difficulty level: Easy

Related Readings/Links:

See https://docs.python.org/3/library/archiving.html for documentation

Potential mentors: @pdxjohnny

Getting Started: Figure out how to do one of the file types, probably gzip (as that probably is as simple as using https://docs.python.org/3/library/gzip.html#gzip.GzipFile if the filename ends in .gz) then move on to the rest. For now just make modifications directly to the FileSource class. We may have you split out the logic later, but don't worry about another class for now.

What we want to see in your application: Describe how you intend to solve the problem, and give us some "stretch goals", maybe implement a remote file source which reads form URLs. Don't forget to include some time for building appropriate tests.

yashlamba commented 5 years ago

Hey! I am Yash from Cluster Innovation Centre, University of Delhi pursuing BTech in Information Technology and Mathematical Innovations. I am interested in contributing to DFFML this summer. Can you suggest me a potential start for this project?

johnandersen777 commented 5 years ago

Hi Yash! Check out https://github.com/intel/dffml/wiki/DFFML-Ideas-Page-for-GSoC-2019#getting-started first. Make sure you can run the tests. Then I'd suggest looking at https://docs.python.org/3/library/archiving.html and making some tests which save and load those file types. After that, take what you've done and integrate it with FileSource

Note: All ideas are open to anyone until someone's proposal is chosen. See http://python-gsoc.org/students.html for more info

yashlamba commented 5 years ago

Hi Yash! Check out https://github.com/intel/dffml/wiki/DFFML-Ideas-Page-for-GSoC-2019#getting-started first. Make sure you can run the tests. Then I'd suggest looking at https://docs.python.org/3/library/archiving.html and making some tests which save and load those file types. After that, take what you've done and integrate it with FileSource

Note: All ideas are open to anyone until someone's proposal is chosen. See http://python-gsoc.org/students.html for more info

So I went through the steps and am able to run all the tests successfully. However, I have some doubts over how to contribute, can I mail you directly or is there some other means of contacting about doubts directly?

johnandersen777 commented 5 years ago

Ya you can email me: johnandersenpdx@gmail.com however, ideally all discussion is kept on GitHub (maybe in #12) so that if you ask a question others might have, my response if viewable to them as well.

yashlamba commented 5 years ago

Going by the Gzip module, it is basically a compression module that reads and writes either str or bytes. I wanted to ask whether we will be writing JSON objects or something pre-defined, or shall I take random objects just to implement it and writing tests for now?

For using dictionaries, we still need to use json for encoding and decoding to bytes (https://stackoverflow.com/questions/39450065/python-3-read-write-compressed-json-objects-from-to-gzip-file)

I have referred mainly to https://www.journaldev.com/19827/python-gzip-compress-decompress#python-gzip-module and https://docs.python.org/3/library/gzip.html

For the basics, I have implemented the following. Is this what that is needed for now? I'll open a WIP: PR if this is what is needed:

For reading:

    async def load_fd(self, fd):
        with gzip.GzipFile(fd, 'rb') as f:
            repos = f.read()            
        #LOGGER.debug('%r loaded %d records', self, len(self.mem))
        f.close()

For Writing:

    async def dump_fd(self, fd):
        data = b'data'
        with gzip.GzipFile(fd, 'wb') as f:
            f.write(data)
        f.close()
johnandersen777 commented 5 years ago

https://github.com/intel/dffml/blob/dd8007d0c9f8c58c35c94faf148e2b5d6ce4c101/dffml/source/file.py#L43-L44

would be

if self.filename[::-1].startswith(('.gz')[::-1]):
    # Check if filename starts with .gz (by reversing .gz and then
    # seeing if the filename in reverse starts with that.
    opener = gzip.GzipFile(fd, 'r')
else:
    # Otherwise just open the file.
    opener = open(self.filename, 'r')

with opener as fd:
    await self.load_fd(fd)
yashlamba commented 5 years ago

That's the part to open the file (and it is integrating it FileSource, which was to be done later), I get it. What about reading, what exactly would we be reading? This might sound silly but is really confusing to me. Which load_fd function will be called and based on what data?

johnandersen777 commented 5 years ago

Yes my bad, I said later, but since we're figuring things out as we go like this, we will just do it now. FileSource is an abstract base class. Which means that classes which inherit from it will have to define those methods.

https://github.com/intel/dffml/blob/dd8007d0c9f8c58c35c94faf148e2b5d6ce4c101/dffml/source/file.py#L54-L60

https://github.com/intel/dffml/blob/dd8007d0c9f8c58c35c94faf148e2b5d6ce4c101/dffml/source/json.py#L19-L27

yashlamba commented 5 years ago

Got that! So ultimately, there would be a gzipsource.py that would have load_fd and dump_fd defined. But my issue with that is what data I would be reading. Do I write them as bytes encoded json format? This is actually almost clear to me if I get what finally I would be reading or writing as gzip only accepts either string or bytes.

johnandersen777 commented 5 years ago

What needs to be done for GZip to be finished is to modify FileSource (For open that's: https://github.com/intel/dffml/issues/15#issuecomment-475374068)

Then repeat for close (using the w flag for write instead of the r flag).

https://github.com/intel/dffml/blob/dd8007d0c9f8c58c35c94faf148e2b5d6ce4c101/dffml/source/file.py#L50-L52

Edit To clarify. The end result is that there will be no more source classes added. Just modifications of FileSource.

yashlamba commented 5 years ago

@pdxjohnny You mentioned that Gzip is the easiest to implement. If I start implementing bz2 module, what difference I might face? I read the module and it seems almost same.

johnandersen777 commented 5 years ago

I'm not sure. I'd say that you could start by creating a testcase, and using those modules to create files of those types with JSON or CSV data in them. Then see if JSONSource and CSVSource read the correct repo data from the files. They should throw errors until to implement the correct GZip, etc, in the open and close methods of FileSource.

yashlamba commented 5 years ago

Okay, Got it! Thank you so much for your help and patience.

johnandersen777 commented 5 years ago

No problemo! Thank you for your contribution!

yashlamba commented 5 years ago

Should I start with bz2? I read the module and found that there's nothing much different.

johnandersen777 commented 5 years ago

Ya go with whatever sounds good to you

yashlamba commented 5 years ago

Hey! So would support of .tar files be any useful? If we can wrap this up soon, it would be easy for me to document.

johnandersen777 commented 5 years ago

Hi Yash! sorry i am still working on a reply to your email. I think this is pretty much done. I don;t think tar support is needed right now. If you want to document what's been implemented with relation to this, that would be awesome. Thank you!

yashlamba commented 5 years ago

Okay, I'll start working on documenting this and other source related classes. I have pretty spent the past couple of days understanding the code for the same. Thank you.

johnandersen777 commented 5 years ago

Sweet! Just ping me if there's anywhere you need clarification.

On Wed, Mar 27, 2019 at 11:03:08AM -0700, Yash Lamba wrote:

Okay, I'll start working on documenting this and other source related classes. I have pretty spent the past couple of days understanding the code for the same. Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.

References

  1. https://github.com/intel/dffml/issues/15#issuecomment-477283460
  2. https://github.com/notifications/unsubscribe-auth/AFrL4XC9FTrKLLc_DTVA3SLCoCetQonbks5va7JcgaJpZM4bncu_
yashlamba commented 5 years ago

Hey! So I have a couple questions:

  1. Are we looking forward to finalize the zip module?
  2. How detailed it should be documented? I have written about the FileSource class and subclasses along with supported modules but do I need to document the super class too (Source) and it's functionality?
johnandersen777 commented 5 years ago
  1. Yes! (I've been caught up with #25)
  2. Do whatever you feel like, but it would probably be good to document Source.
johnandersen777 commented 5 years ago

Closed via #38