m3dev / gokart

Gokart solves reproducibility, task dependencies, constraints of good code, and ease of use for Machine Learning Pipeline.
https://gokart.readthedocs.io/en/latest/
MIT License
305 stars 57 forks source link

[Feature Request] add support for `.lmdb` #261

Open S-aiueo32 opened 2 years ago

S-aiueo32 commented 2 years ago

I would like to add .lmdb to the file formats supported by TaskOnKart.make_target(). .lmdb is the format used by several popular datasets, and is actually suitable for handling large datasets (especially images).

For those who are not familiar with .lmdb, here is a brief explanation of its usage. .lmdb is a Key-Value store whose values can be retrieved via the lmdb.Environment object:

import lmdb

# open environment
env = lmdb.open('foo')

# put item
key: bytes = ...
value: bytes = ...
with env.begin(write=True) as txn:
    txn.put(key, value)

# get item
with env.begin() as txn:
    _value = txn.put(key)
    assert value == _value

The following changes are necessary to support .lmdb:

  1. Add LmdbFileProcessor:

    
    class LmdbFileProcessor(FileProcessor):
    
    def __init__(self, file_path: str) -> None:
        self.__file_path = file_path
        super(LmdbFileProcessor, self).__init__()
    
    @property
    def env(self) -> lmdb.Environment:
        # To avoid creating `env` in the constructor for paths other than .lmdb. 
        if not hasattr(self, '__env'):
            self.__env = lmdb.open(self.__file_path, subdir=False, lock=True)
        return self.__env
    
    def load(self, file):
        with self.env.begin(write=False) as txn:
            for k, v in txn.cursor():
                yield k, v
    
    def dump(self, obj):
        key, value = obj
        with self.env.begin(write=True) as txn:
            txn.put(key, value)

...

def make_file_processor(file_path: str, store_index_in_feather: bool) -> FileProcessor: extension2processor = { ... '.lmdb': LmdbFileProcessor(file_path) }


2. Prevent `SingleFileTarget` from opening files before dumping.
```python
    def _dump(self, obj) -> None:
        if self._processor.__class__.__name__ == 'LmdbFileProcessor':
            self._processor.dump(obj)
        else:
            with self._target.open('w') as f:
                self._processor.dump(obj, f)
  1. Write tasks!

    
    class DownloadImages(GokartTask):
    def run(self):
        image_ids = self.load()
        for image_id in image_ids:
            image: bytes = imread(image_id)
            self.dump((image_id.encode(), image))
    
    def output(self):
        return self.make_target(f'{self.__class__.__name__}.lmdb')

class LoadImages(GokartTask): def requires(self): return DownloadImages()

def run(self):
    for key, value in self.load():  # load keys and values of previous db
        # do something with the images
        ...


They work perfectly in my environment, but there are a few untested concerns:

1. It may violate the design philosophy of gokart and luigi to call `dump()` only once at the end of `run()`. 
2. It may cause problems with parallelization.

I would like to commit what I have sorted out here as a Pull Request, and I would like to discuss it in this issue.
Hi-king commented 2 years ago

Cool feature :)

Mostly I agree to this proposal.

It may violate the design philosophy of gokart and luigi to call dump() only once at the end of run().

I think this is not a problem. With current implementation, we can dump multiple times with named targets as follows

class Task(GokartTask):
    def requires(self):
        return dict(
           a=self.make_target("a.pkl"),
           b=self.make_target("b.pkl"),
       )

    def run(self):
      self.dump(1, "a.pkl")
      self.dump(2, "b.pkl")

Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)

S-aiueo32 commented 2 years ago

Thank you for your reply.

For now, I have implemented a way to use SingleFileTarget to support lmdb, and passed the test locally. https://github.com/S-aiueo32/gokart/tree/4d65e5f359a44d4113945dd82fc2c3171fa973da In LocalTargetTest, there was an error around the locking of lmdb, so I made sure not to lock it when opening lmdb.environment (probably not a problem since it is locked in other parts of gokart).

Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)

I see. It is indeed awkward to have a conditional branch depending on the processor type in SingleFileTarget. It would be better to create a new LmdbTarget that inherits from SingleFileTarget, and rewrite load and dump.

If we create a new LmdbTarget, do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target?

mski-iksm commented 2 years ago

@S-aiueo32 Thank you for proposing great feature!

It would be better to create a new LmdbTarget that inherits from SingleFileTarget

I think LmdbTarget should inherite TargetOnKart just like SingleFileTarget or ModelTarget, not to make overcomplicating inheritance dependencies.

do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target

I think creating new function make_lmdb_target is better, just like existing make_model_target.

hirosassa commented 2 years ago

@S-aiueo32 Hi, This is just a friendly reminder (not urgent). What's the status of this issue?