Improve memory efficiency of lookup in external CSV files

sphereio / csv-mapper

Node library for mapping CSV files with flexible YAML DSL for the mapping definition

MIT License

16 stars 2 forks source link

Improve memory efficiency of lookup in external CSV files #5

Closed nkuehn closed 8 years ago

nkuehn commented 9 years ago

In this mapping job https://github.com/nkuehn/sphere-icecat-importer/blob/master/mapping_categories.yaml
the memory used by the csv-mapper explodes quickly because of the large number of lookups in external csv files (which are dynamic data with thousands of lines and can therefore not be included as inline maps). It would be cool if that were not the case but probably tricky as the external lookup CSVs are all different.

OlegIlyenko commented 9 years ago

The lookup value transformer was designed for lookup tables that fit into the memory. The reason for this is that all transformed rows need to access the lookup table at random.

If you have a big lookup dataset then maybe something like a relational database would be a better fit. They were designed to handling large sets of data and especially relations between different entities (joins). If your dataset it too big for modern relation databases (like postges), then I would suggest something like hadoop + spark.

OlegIlyenko commented 9 years ago

One of the possibilities would be to implement new value transformer which is similar to lookup one, but uses some relation or no-sql DB to lookup the values (possibly with caching layer on top of it). But I feel that it's out of the scope of this particular project.

The value transformer mechanism was designed to be open and extensible, so you can easily implement some use-case specific value transformers. I would recommend you to look at https://github.com/sphereio/sphere-product-mapper as an example of it. It defines several application-specific transformers which use SPHERE.IO API to produce values.

nkuehn commented 9 years ago

You mean this one: https://github.com/sphereio/sphere-product-mapper/blob/master/src/coffee/sphere_transformer.coffee
I hoped to be able to skip the coffescript era...

@OlegIlyenko could you point me to the place in the code where the CSVs are loaded and their in-memory representation is built? That's usually the place where the memory inefficiencies are to be found (the memory usage is multiple times higher than the size of the files loaded).

OlegIlyenko commented 9 years ago

Yes, sphere_transformer.coffee defines sphere-specific value transformars.

You can find lookup CSV loading code here:

https://github.com/sphereio/csv-mapper/blob/master/src/coffee/transformer.coffee#L244

nkuehn commented 9 years ago

thanks. One Idea: instad of loading the full file into a "contents" variable and then additionally creating arrays of every column as "data" we could stream-extract just the key column and the value column into a Map object and neither keep the whole file nor the unused columns in memory. As a side effect, that would make the mapping itself faster, too ( https://github.com/sphereio/csv-mapper/blob/master/src/coffee/transformer.coffee#L270 looks like it does a full scan on every lookup)

Am I getting the code right? Coffeescript noob, you know.

OlegIlyenko commented 9 years ago

Yes, you are right. it would be a good improvement for big lookup tables.