signaux-faibles / predictsignauxfaibles

Dépôt du code python permettant la production de liste de prédiction Signaux Faibles.
MIT License
6 stars 1 forks source link

fix: force siren and sirets to be strings #68

Closed vviers closed 3 years ago

vviers commented 3 years ago

Closes #58

slebastard commented 3 years ago

The following code yields that fields siret and siren are of type object when fetched from Mongo, but that they are cast to str when necessary:

import pandas as pd
from predictsignauxfaibles.data import SFDataset

FIELDS_TO_QUERY =  ["siret", "siren", "periode", "outcome", "time_til_outcome"]
dataset = SFDataset(
    date_min="2015-01-01",
    date_max="2020-06-30",
    fields=FIELDS_TO_QUERY,
    sample_size=1000
)
dataset.fetch_data()

print(dataset.data.dtypes)
print(f"First SIRET in sample: {dataset.data.siret[0]}")
print(f"First SIREN in sample: {dataset.data.siren[0]}")

Moreover, casting those fields as str doesn't seem to change their dtypes nor their individual cast type:

dataset.data.siret = dataset.data.siret.astype(str)
dataset.data.siren = dataset.data.siren.astype(str)

print(dataset.data.dtypes)
print(f"First SIRET in sample: {dataset.data.siret[0]}")
print(f"First SIREN in sample: {dataset.data.siren[0]}")

So a few questions for me to understand and validate:

vviers commented 3 years ago

So a few questions for me to understand and validate:

  • Can we pinpoint where are SIRET/SIREN found as integers?

Yes, you can do :

# fetch data
...

# Which siret are not strings ?
dataset.data[dataset.data.siret.apply(lambda x: not isinstance(x, str)]
  • Why are SIRET and SIREN fields retrieved from mongo as object types?

There is no str type in pandas, cf. pandas_dtypes

  • Are we sure that this will fix our issue?

I tested with and without my fix on my main use case (accessing a row by siret like data[data.siret == "siret_as_string"]) and it works 🙂

But I am open to diving deeper into the topic if needed

slebastard commented 3 years ago

Thanks for providing those details. I had no idea pandas had no str type 😅 Approving and merging.