mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.
https://mongo-arrow.readthedocs.io
Apache License 2.0
92 stars 14 forks source link

ARROW-176 Nested extension objects are not handled in auto schema #166

Closed NoahStapp closed 1 year ago

NoahStapp commented 1 year ago

Pre-commit hooks are failing on the PR but are succeeding locally:

(pymongoarrow) nstapp@M-THWWK302GC python % pre-commit run --all-files
check for added large files..............................................Passed
check for case conflicts.................................................Passed
check toml...............................................................Passed
check yaml...............................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
forbid new submodules................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
black....................................................................Passed
isort....................................................................Passed
flake8...................................................................Passed
shellcheck...............................................................Passed
Check GitHub Workflows...................................................Passed
blacken-docs.............................................................Passed
blink1073 commented 1 year ago

It seems odd that that your local pre-commit caused changes in those files. Can you try removing ~/.cache/pre-commit/ and then run that command again?

NoahStapp commented 1 year ago

Still the same result after uninstalling pre-commit, removing the cache, and re-installing:

(pymongoarrow) nstapp@M-THWWK302GC python % pre-commit run --all-files
check for added large files..............................................Passed
check for case conflicts.................................................Passed
check toml...............................................................Passed
check yaml...............................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
forbid new submodules................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
black....................................................................Passed
isort....................................................................Passed
flake8...................................................................Passed
shellcheck...............................................................Passed
Check GitHub Workflows...................................................Passed
blacken-docs.............................................................Passed
ShaneHarvey commented 1 year ago

Are the benchmark failures due to the changes here or are they just false positives (caused by perf noise in the runner)?

NoahStapp commented 1 year ago

Are the benchmark failures due to the changes here or are they just false positives (caused by perf noise in the runner)?

It's possible they're caused by the changes here as the benchmarks experiencing regressions are Document tests, but I'm not familiar enough with the benchmarks to conclusively say so.

ShaneHarvey commented 1 year ago

@blink1073 could it be we're copying more data around than we need to when doing this new casting?

blink1073 commented 1 year ago

Yeah, a ratio of 2.31 on benchmarks.ProfileReadDocument.time_to_arrow seems to show Arrow has to recreate the full document.

blink1073 commented 1 year ago

What we have here is more technically correct though, so I think it is worth opening a new ticket to track using custom C builder classes in a minor release as opposed to this casting after the fact.

ShaneHarvey commented 1 year ago

SGTM!

NoahStapp commented 1 year ago

What we have here is more technically correct though, so I think it is worth opening a new ticket to track using custom C builder classes in a minor release as opposed to this casting after the fact.

So this change is fine for now until we implement this same behavior with custom builder classes?

blink1073 commented 1 year ago

So this change is fine for now until we implement this same behavior with custom builder classes?

Yes

blink1073 commented 1 year ago

Can you please revert the import sorting changes that are failing the pre-commit build? You may have to do it in the GitHub UI (hit the period key to open up the editor).