s-leroux / fin

Set of tools for personal investment
MIT License
1 stars 0 forks source link

Floating point data columns are sometimes displayed as ternary values #45

Closed s-leroux closed 1 month ago

s-leroux commented 2 months ago

In some circumstances, floating-point number columns are displayed as ternary columns:

from fin.api.yf import Client
from fin.seq import fc

ticker = "^FCHI"
duration = dict(days=5)

client = Client()
data = client.historical_data(ticker, duration)

# Yahoo! Finance has dirty data. Do some clean-up
data = data.where(
        (fc.all, "Open", "High", "Low", "Close", "Adj Close"),
    )

print(data)

Displays:

      Date | Open     | High     | Low      | Close    | Adj Clo… |   Volume
---------- | -------- | -------- | -------- | -------- | -------- | --------
2024-05-03 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 69643700
2024-05-06 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 43781100
2024-05-07 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 58688300
2024-05-08 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |        0

It is probably caused when the same column has cached several representations. We do not have a mechanism to select the "most appropriate" representation.

s-leroux commented 1 month ago

Two possible solutions:

  1. The exact representation is based on the column's type,
  2. We remember what was the native type at the column's creation time,
s-leroux commented 1 month ago

Two possible solutions:

1. The exact representation is based on the column's type,

2. We remember what was the native type at the column's creation time,

The first option does make a lot of sense. Unfortunately, the preliminary experiments increased the coupling between ColType and Column. The problem is that the Column object caches the various representations, and we'd like to keep the internal details private here. For that reason, we also don't want to expose a "conversion" interface from Column.

To reduce the coupling, we might also return a typecode from the ColType instances.

Finally, we can remember if the column that was created using one of the Column.from_xxxx_mv() factory methods.

s-leroux commented 1 month ago

Actually, the implementation uses the "richest" representation first by default (Python objects > float > ternary), so it is not a conversion issue. The problem was with Column.c_remap, which didn't copy all the cached data types.