lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
317 stars 105 forks source link

[Bug] - Data partitioner reads repeated data in stream #1191

Closed gfinol closed 10 months ago

gfinol commented 11 months ago

When using the data partitioner obj_chunk_numer or obj_chunk_size parameters and reading data from the obj.data_stream I found that the total read data is bigger than the size of the read file.

Here is a code snipped to reproduce the bug. I tested it with a file of 5,000,000 bytes, each line of the file is 100 bytes long.

    def map_function(obj):
        chunk_size = 10000
        data = obj.data_stream

        chunk = data.read(chunk_size)
        r = b""
        while chunk:
            r += chunk
            chunk = data.read(chunk_size)

        return len(r)

    # fexec = lithops.FunctionExecutor()
    fexec = lithops.FunctionExecutor(backend='localhost', storage='localhost')

    dataset = "/path/to/file.txt"

    fexec.map(map_function, dataset, obj_chunk_number=2)
    res = fexec.get_result()
    print(res)

The bug is also present when using aws_lambda and aws_s3 as backends.

I've used different numbers for both obj_chunk_numer or obj_chunk_size. The current behavior is that the first N-1 mapper functions read chunk_size bytes extra. The last function reads the correct number of bytes.

According to lithops documentation, each obj_chunk_size must be at least 1 MIB. This condition is fulfilled in this case. Is there any other requirement that I may be missing?

gfinol commented 11 months ago

Note: When reading data using obj.data_stream.read(), each function reads the correct amount of bytes.

The problem only occurs when reading chunks from the data_stream.

JosepSampe commented 10 months ago

I understand the bug appears when you use a custom chunk_size, right?

gfinol commented 10 months ago

Yes, that's rigth.

Custom chunk_size from the obj.data_stream created by the data partitioner.

JosepSampe commented 10 months ago

@gfinol I added a potential fix for this in #1215 , can you try with master branch?

gfinol commented 10 months ago

@JosepSampe Looks like #1215 fixed this. I close the issue.

Thank you very much!