mcaceresb / stata-parquet

Read and write parquet files from Stata
MIT License
22 stars 6 forks source link

Import ByteArray as strl #13

Closed kylebarron closed 5 years ago

kylebarron commented 6 years ago

strL is designed as a variable-width string data type for Stata. It sounds like you currently search the first N rows in the dataset searching for the longest length of string? It wouldn't be a horrible default to read in ByteArray as strL. Then you should never have an overflow error.

Is there a performance penalty after import to working with strings that are stored as strL, even if they aren't really long?

mcaceresb commented 6 years ago

Not possible in the current API.

kylebarron commented 6 years ago

You can't store data as strl?

mcaceresb commented 6 years ago

The plugin interface does not support writing to strL variables.

kylebarron commented 6 years ago

Do you ever come across precision issues by having all numeric data transfer to and from Stata happen in doubles?

mcaceresb commented 6 years ago

If both types are double then no way it's a meaningful loss (except in edge cases for stuff like quantiles). If one is a float then yes, for sure.

kylebarron commented 6 years ago

Actually it should be fine since a double can hold without losing precision every other data type available in Stata. A double is strictly larger in set size than a float, and should be able to hold int32 values without issue.

mcaceresb commented 6 years ago

I do think this can get messed up for some summary stats of your doule is really 7.000000001 or sth like that, but once it is in Stata or parquet as an int then it won't matter, yes.

kylebarron commented 6 years ago

But that would've been a double in Stata or Parquet to begin with?

mcaceresb commented 6 years ago

They might have been an integer but internally it gets passed around as double. I really don't think it matters for read/write and type conversion, unless it's a massive integer.

. disp %21.0f 2^64 - 2
 18446744073709551616

. disp %21.0f 2^64 - 2^8
 18446744073709551616