Closed vviers closed 3 years ago
The following code yields that fields siret
and siren
are of type object
when fetched from Mongo, but that they are cast to str
when necessary:
import pandas as pd
from predictsignauxfaibles.data import SFDataset
FIELDS_TO_QUERY = ["siret", "siren", "periode", "outcome", "time_til_outcome"]
dataset = SFDataset(
date_min="2015-01-01",
date_max="2020-06-30",
fields=FIELDS_TO_QUERY,
sample_size=1000
)
dataset.fetch_data()
print(dataset.data.dtypes)
print(f"First SIRET in sample: {dataset.data.siret[0]}")
print(f"First SIREN in sample: {dataset.data.siren[0]}")
Moreover, casting those fields as str
doesn't seem to change their dtypes nor their individual cast type:
dataset.data.siret = dataset.data.siret.astype(str)
dataset.data.siren = dataset.data.siren.astype(str)
print(dataset.data.dtypes)
print(f"First SIRET in sample: {dataset.data.siret[0]}")
print(f"First SIREN in sample: {dataset.data.siren[0]}")
So a few questions for me to understand and validate:
object
types?So a few questions for me to understand and validate:
- Can we pinpoint where are SIRET/SIREN found as integers?
Yes, you can do :
# fetch data
...
# Which siret are not strings ?
dataset.data[dataset.data.siret.apply(lambda x: not isinstance(x, str)]
- Why are SIRET and SIREN fields retrieved from mongo as
object
types?
There is no str
type in pandas, cf.
- Are we sure that this will fix our issue?
I tested with and without my fix on my main use case (accessing a row by siret like data[data.siret == "siret_as_string"]
) and it works 🙂
But I am open to diving deeper into the topic if needed
Thanks for providing those details.
I had no idea pandas had no str
type 😅
Approving and merging.
Closes #58