scramjetorg / scramjet

Public tracker for Scramjet Cloud Platform, a platform that bring data from many environments together.
https://www.scramjet.org
MIT License
253 stars 20 forks source link

Slow execution time of reading a big file #128

Open tomkeee opened 2 years ago

tomkeee commented 2 years ago

I've tested the platform on how fast it would analyze a big .csv file (532MB, 3 235 282 lines). The execution time of the program (code below) is about 25 minutes.

The program should just print the current line with a very simple comment

main.py

from scramjet.streams import Streamfrom scramjet.streams import Stream

lines_number = 0 

def count(x):
    global lines_number
    lines_number += 1
    return lines_number

def show_line_number(x):
    global lines_number
    if lines_number <1000:
        return f"{lines_number} \n"
    elif lines_number > 2000:
        return f"{lines_number} bigger than 2000 \n"
    return None

def run(context,input):            
    x = (Stream
        .read_from(input)
        .each(count)
        .map(show_line_number)
    )
    return x

lines_number = 0 

def count(x):
    global lines_number
    lines_number += 1
    return lines_number

def show_line_number(x):
    global lines_number
    if lines_number <1000:
        return f"{lines_number} \n"
    elif lines_number > 2000:
        return f"{lines_number} bigger than 2000 \n"
    return None

def run(context,input):            
    x = (Stream
        .read_from(input)
        .each(count)
        .map(show_line_number)
    )
    return x

package.json

{
    "name": "@scramjet/python-big-files",
    "version": "0.22.0",
    "lang": "python",
    "main": "./main.py",
    "author": "XYZ",
    "license": "GPL-3.0",
    "engines": {
        "python3": "3.9.0"
    },
    "scripts": {
        "build:refapps": "yarn build:refapps:only",
        "build:refapps:only": "mkdir -p dist/__pypackages__/ && cp *.py dist/ && pip3 install -t dist/__pypackages__/ -r requirements.txt",
        "postbuild:refapps": "yarn prepack && yarn packseq",
        "packseq": "PACKAGES_DIR=python node ../../scripts/packsequence.js",
        "prepack": "PACKAGES_DIR=python node ../../scripts/publish.js",
        "clean": "rm -rf ./dist"
    }
}

requirements.txt scramjet-framework-py

Tatarinho commented 2 years ago

@tomkeee Print operation in every language is slow, if you want to print every line of big file, you have to keep in mind that will drastically slow down your operation. What is a time of execution (processing full file) without printing?

tomkeee commented 2 years ago

@Tatarinho You are right that print operations are quite slow, yet I tried to do a similar operation on my local machine (code below) and the execution time was 55 sec (on Scramjet it was 25min).

import time

lines_number = 0
with open("/home/sirocco/Pulpit/data.csv") as file_in:
    start = time.time()
    for i in file_in:
        if lines_number <1000:
            print(f"{lines_number} \n")
        elif lines_number > 2000:
            print(f"{lines_number} bigger than 2000 \n")
        lines_number += 1

    print(f"the line_numbers is {lines_number}\n execution time: {time.time()-start}")

Zrzut ekranu z 2022-09-29 09-20-57

MichalCz commented 2 years ago

Hi @tomkeee, we'll be looking into this next week.

MichalCz commented 2 years ago

Hmm... so I did some initial digging and was able to run a test with local network and a similar program in node works quite fast, but not as fast as from the disk...

We need to take into account the network connection, but that wouldn't explain 25 minutes.

Could you follow this guide: https://docs.scramjet.org/platform/self-hosted-installation

Then based on this, can you try your program with the data sent to 127.0.0.1? We'd exclude the network and the platform configuration as the culprit...