Open DataEnggNerd opened 1 month ago
Thank you! Tracking in JIRA https://jira.mongodb.org/browse/ARROW-252
@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?
@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?
Yes! Are you able to send a PR with the proposed changes?
@aclark4life I would like to discuss the design before getting into implementation. In Jira I have observed that there is a suggestion of a new data type, which I am fine with. But, on such implementation, schema is expected to be passed only for such field. And how to pass schema for nested keys?
Any help is appreciated.
No problem! Does this help at all? https://mongo-arrow.readthedocs.io/en/1.3.0/schemas.html#nested-data-with-schema I believe we're in agreement that we could support adding a new field type StrToIntField
or IntToStrField
as @ShaneHarvey suggested.
While fetching data with
find_polars_all
,find_pandas_all
,find_arrow_all
frompymongoarrow.api
, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred asnull
.MongoDB documentation
Current implementation
In case of such known discrepancies where the first document have
pyarrow.str()
and subsequent documents havepyarrow.int*()
, which can be inferred aspyarrow.str()
by adding an optional parametercoerce_number_to_str
for allfind_*
apis.Expected implementation
Reference -
coerce_numbers_to_str
in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field