petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.22k stars 189 forks source link

Help Wanted: Bad performance using etl.dicts(self.table) #670

Open Maksymilian-Plywaczyk opened 1 week ago

Maksymilian-Plywaczyk commented 1 week ago

What happened?

Hi guys!

I need your help in my problem with performance of function etl.dicst(). While I want use that function with list() I had very slow and bad performance. Code example looks like this: etl.dicts(self.table)

What is the expected behavior?

Can someone give me advice or hint how to convert table to list of dictionaries in effiecient way? Or maybe use another solution for that. In the end i need converted Table type to python native data structures.

Reproducible test case

No response

What version of petl are you have found the bug?

v1.7.12

Version

python 3.13+

What OS are you seeing the problem on?

Linux

What OS version are you using?

No response

What package manager you used to install?

pip

What's the current installed packages?

No response

Relevant log output

No response

Additional Notes

No response

Code of Conduct

juarezr commented 1 week ago

@Maksymilian-Plywaczyk,

What source/destination are you reading/writing? Do you have any code snippet for analysis?

Maksymilian-Plywaczyk commented 1 week ago

@juarezr Yes, I am writing a service to transform a given dataset. The dataset is a response from an external API. I am creating this service using method chaining. In the end after transforming in I want to get back with type of dataset that was given. Last method which i call is convert_table_to_dict and here I get very bad performance. Dataset has only ~100 rows.

class DatasetTranformation:
def __init__(self, dataset: list[dict[str, Any]]):
    self.dataset = dataset
    self.table = self._transform_dataset_to_table()

def _transform_dataset_to_table(self) -> etl.Table:
    return etl.fromdicts(self.dataset)

def drop_fields(self, fields: list[str]) -> "DatasetTransformation":
    self.table = etl.cutout(self.table, *fields)
    return self

def add_date_field_from_timestamp(
        self, timestamp_field: str, date_field: str
) -> "DatasetTransformation":
    def format_date(row) -> str:
        timestamp = row[timestamp_field]
        date = datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%S.%fZ").strftime(
            "%Y-%m-%d"
        )
        return date

    self.table = etl.addfield(self.table, date_field, format_date)
    return self

def add_field(
        self, field_name: str, value_function: Any = None
) -> "DatasetTransformation":
    self.table = etl.addfield(self.table, field_name, value_function)
    return self

def convert_table_to_dict(self) -> list[dict[str, Any]]:
    converted_data = list(etl.dicts(self.table))
    return converted_data

def to_table(self) -> etl.Table:
    return self.table
juarezr commented 6 days ago

This call will make your petl object to materialize:

converted_data = list(etl.dicts(self.table))

What will happen, is that:

  1. The list method will traverse the whole results that self.table should retrieve.
  2. All result rows/dicts will be allocated in the python/app memory.
  3. Each service call will allocate its list full of rows. This would raise the app's whole memory according to the number of concurrent calls.

What happens if you change to something like:


def convert_table_to_dict(self) -> Iterable[dict[str, Any]]:
    data_generator = etl.dicts(self.table)
    return data_generator

If you could use/consume this way:

  1. Each service method invocation would allocate just one object/row/dict by petl each transformation used.
  2. You could write the response through a foreach loop and stream as you receive the rows/results from the source/database.

I used this trick in a custom Airflow Operator and it worked well.

But maybe I'm missing something...

Maksymilian-Plywaczyk commented 6 days ago

@juarezr thank you for your answer. I give you more context. I have function to save a data to my filesystem. And as parameter I take data which is data from convert_table_to_dict(self). And I checked that in the moment of convert_table_to_dict comes my app has very weak performance. I tried to return Table object from function to_table and in save_data takes Table object as parameter and try, except then looks like this (but this also has very weak performance). I have the impression that always at the end with the last method call it has very poor performance, as if it is clogged.

try: etl.to_csv(data, "filename")

The save data function:

def save_data(self, data: list[dict[str, Any]]) -> None:

    timestamp = timezone.now().strftime("%Y-%m-%d-%H-%M-%S")
    filename = self.FILENAME_TEMPLATE.format(timestamp=timestamp)
    filepath = os.path.join(settings.MEDIA_ROOT, self.REPORTS_PATH, filename)

    try:
        df = pd.json_normalize(data, max_level=0)
        df.to_csv(filepath, index=False)
    except Exception as e:
        raise e

    self.model.objects.create(filename=filename)

### WHERE THIS CLASS AND FUNCTION IS USED.    
def transform_data(self, data: list[dict[str, Any]]) -> list[dict[str, Any]]:
    transformation = DatasetTransformation(data, api_client=APIClient())
    transformed_data = (
        transformation.add_date_field_from_timestamp("edited", "date")
        .get_field_from_url_field("films")
        .drop_fields(
            ["created", "edited"]
        )
        .convert_table_to_dict()
    )
    return transformed_data

def get(self, request, *args, **kwargs):
    data = self.get_data()
    transformed_data = self.transform_data(data)
    self.save_data(transformed_data)
    return redirect("index")