juancarlospaco / faster-than-csv

Faster CSV for Python
https://juancarlospaco.github.io/faster-than-csv
MIT License
99 stars 8 forks source link

faster_than_csv.csv2list() has outdated documentation #12

Closed SekouDiaoNlp closed 3 years ago

SekouDiaoNlp commented 3 years ago

Describe the bug The documentation for csv2list says that the function has the following signature:

csv2list(csv_file_path: string; has_header: bool = true; separator: char = ',';
             quote: char = '\"'; skipInitialSpace: bool = false; verbose: bool = false): seq[
    string] {...}

To Reproduce Steps to reproduce the behavior:

  1. pip install faster_than_csv
  2. create a new python file and import faster_than_csv as csv
  3. Load a csv file following example : csv.csv2list("example.csv")
  4. See error : TypeError: csv2list() takes exactly 7 arguments (1 given)
  5. After reading the documentation for csv2list I noticed that only 6 arguments are in the function signature so I tried to cal : csv.csv2list("example.csv" **kwargs)with all the keyword-arguments from the documentation. This time I got the following error: TypeError: csv2list() takes exactly 7 arguments (6 given)

Expected behavior The expected behaviour was that just a call to csv.csv2list("example.csv") should work, as all the keyword arguments have default values. It is even more troubling csv.csv2list("example.csv" **kwargs) with all the keyword-arguments from the documentation ddes not work either.

What is the proper way to call faster_than_csv.csv2list()

Desktop (please complete the following information):

Additional context Can you please tell me what is the proper way to call faster_than_csv.csv2list() and maybe update the documentation to make it more obvious how to use the library

Thank you in advance

@SekouDiaoNlp.

SekouDiaoNlp commented 3 years ago

I digged into the code at aster-than-csv/faster_than_csv/faster_than_csv.nim

and saw that the actual function has a completely different signature than on the documentation:

proc csv2list*(csv_file_path: string; columns: Natural; rows: Natural; separator: char = ',';
    quote: char = '"'; escape: char = '\x00'; skipInitialSpace: bool = false): seq[string] {.exportpy, noinit.} =
  ## Stream Read CSV to a list of strings.
  result = newSeqOfCap[string](columns * rows)
  let parser {.noalias.} = create CsvParser
  parser[].open(csv_file_path, separator, quote, escape, skipInitialSpace)
  parser[].readHeaderRow()
  while parser[].readRow(columns):
    for column in parser[].headers.items:
      result.add parser[].rowEntry(column)
  parser[].close()
  dealloc parser

But there are no comments as to what columns: Natural; rows: Natural correspond to and what are their possible values.

I tried with all default keyword arguments but got this error:

TypeError: csv2list() got multiple values for argument columns

SekouDiaoNlp commented 3 years ago

Are we supposed to manually supply the number of rows and columns of the csv File

That would be pointless, if we have to open the file split the first line to get the column number then count the number of rows.

In that case It is way better to load the csv yourself by iterating the rows and splitting each row by the separator.

I am missing something?

Cheers,

@SekouDiaoNlp .

juancarlospaco commented 3 years ago

Documentation Bug, yes I think I forgot to update docs. :neutral_face:

juancarlospaco commented 3 years ago

Are we supposed to manually supply the number of rows and columns of the csv File

Yes, it is the rows and columns of the CSV.

But can be a guess, an approximation, is Ok if it is wrong, you can just pass 1, like columns = 1, rows = 1.

That would be pointless, if we have to open the file split the first line to get the column number then count the number of rows.

So, lets learn a little about memory and data structures in computers...

Python when you add 1 char to a string, it duplicates the string in memory, the new string has the new char, allocates memory for 2 strings, then switches to the new string.

Nim when you add 1 char to a string, it adds 1 char to the same string, does not duplicates the string in memory, allocates memory for 1 char more in the same string, growing the string capacity.

So imagine you start with a new empty string "", you add 1_000_000 char to that string, thats 1_000_000 duplicated strings and 1_000_000 allocations of memory in Python, Nim optionally allows you to allocate the total capacity of strings and lists, thats 0 duplicated strings and 1 allocations of memory in Nim, this difference gets bigger with bigger sized strings or lists.

So that "capacity" to pre-allocate is the "rows" and "columns" in the function, so if you have an approximation of the rows or columns of your CSV you can use it to speed up the logic.

If you get it wrong, Nim fallbacks to do 1 allocation for each new item in the string or list, just like Python.

Example uses a string, but is the same for lists (strings are just list of char for the computer).

SekouDiaoNlp commented 3 years ago

Are we supposed to manually supply the number of rows and columns of the csv File

Yes, it is the rows and columns of the CSV.

But can be a guess, an approximation, is Ok if it is wrong, you can just pass 1.

Thanks for the explanation @juancarlospaco ! I often have to parse arbitrary csv files with arbitrary number of columns and rows so I was under the impression that I would first have to parse the file to have the dimensions of the rows and columns, and then call csv2list(), which seems very inefficient and redundant to me. If I can just pass the value 1 for both rows and columns it makes a lot more sense :sweat_smile:

That would be pointless, if we have to open the file split the first line to get the column number then count the number of rows.

So, lets learn a little about memory and data structures in computers...

....

So imagine you start with a new empty string "", you add 1_000_000 char to that string, thats 1_000_000 duplicated strings and 1_000_000 allocations of memory in Python, Nim optionally allows you to allocate the total capacity of strings and lists, thats 0 duplicated strings and 1 allocations of memory in Nim, this difference gets bigger with bigger sized strings or lists.

I am very familiar with Python so I knew about that behaviour. However I never used NIM, (it was even the first time I heard about it this morning while I was chasing where the csv2list() method was declared and found it declared in the file /faster_than_csv/faster_than_csv.nim . I could still read and understand the source code but I was looking in *.py files at first.) so I was not aware of the difference in behaviour with python.

I kinda expected that you had faster performance because you used cython, or even wrote the csv reader in C or C++.

So that "capacity" to pre-allocate is the "rows" and "columns" in the function, so if you have an approximation of the rows or columns of your CSV you can use it to speed up the logic.

If you get it wrong, Nim fallbacks to do 1 allocation for each new item in the string or list, just like Python.

Example uses a string, but is the same for lists (strings are just list of char for the computer).

Thanks again for the explanation it makes total sense, however I have 2 suggestions to make:

Thanks again for your quick feedback. I will have a look at the NIM programming language it looks very cool.

Peace,

@SekouDiaoNlp

juancarlospaco commented 3 years ago

I am very familiar with Python so I knew about that behaviour. However I never used NIM, (it was even the first time I heard about it this morning while I was chasing where the csv2list() method was declared and found it declared in the file /faster_than_csv/faster_than_csv.nim . I could still read and understand the source code but I was looking in *.py files at first.) so I was not aware of the difference in behaviour with python.

Yes it has Python-like syntax, but compiled and static typed, imagine Rust but with Python syntax, I moved from Cython to Nim, is nice that can also run in the browser, because I dont like TypeScript.

Thanks again for the explanation it makes total sense, however I have 2 suggestions to make:

  • Will it be possible again in the future to just call csv.csv2list('file.csv') without having to supply all the other keyword arguments (assuming they have sane default values)?

Ok.

  • The fact that the function is defined in a nim file in the NIM language prevents the IDE (I tried with Pycharm and VsCode to find where the function is declared as it assumes it must be a python function. Is there a way to alter this behaviour, or do you think it is not possible as the IDEs assume all variables/objects/methods/functions etc.... are written in the same language.

PyCharm, VS Code/VS Codium support Nim:

SekouDiaoNlp commented 3 years ago

Thanks @juancarlospaco you rock!

I will install the plugins.

And I think I am going to learn NIM!

Peace!

@SekouDiaoNlp

juancarlospaco commented 3 years ago

Implemented your suggestions, and done some improvements, feel free to test it... :slightly_smiling_face:

SekouDiaoNlp commented 3 years ago

@juancarlospaco , hombre, eres asombroso!!!

You've been very quick in implementing the changes.

I will download the latest version and let you know if everything works ok.

I hope more Open Source project maintainers were as proactive as you!

Muchas gracias,

@SekouDiaoNlp

SekouDiaoNlp commented 3 years ago

By the way I started learning NIM yesterday evening.

The syntax is pretty familiar as long as you are a Pythonista or a C programmer I guess.

I like the fact that you can compile your projects to C and then use any good old C compiler to create stand-alone programs with no issues with dependencies. The CFFI is also well implemented.

I guess I will code my next project in NIM to get some more experience and learn the quirks of the language.