Open tgrunzweig-cpacket opened 1 year ago
Hi @Tzahi-cpacket!
Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! In the mean time, feel free to add any relevant information to this issue.
@Tzahi-cpacket After looking into this more, one issue is that our write_to_file_stage.py
will iteratively write to files as messages are received and keep the file stream open for the duration of the stage. With CSV and JSON lines format, this works well to offload data to disk. However, the only library I found that supports appending to Parquet files is fastparquet
. So this leaves a few options:
fastparquet
engine in pandas
Are any of these options preferable over another?
perhaps we can discuss some questions regarding this?
Are you not concerned that the files (current csv or json implementation, future single parquet) become huge over time? This is not just our usecase too, I mean in general.
There is something to be said for a "perpetual usecase", where the pipeline is running for an extended period of time, weeks? months maybe? Once we implement a "directory watcher" as the input stage, the pipeline should never shut down, right? As far as we understand, we can even update the models using a second pipeline while the inference pipeline is working. So the pipeline will just keep on going, outputting forever.
Is it technically possible to build an output dataframe in memory during pipeline operations, and flush it to a new file when the size has reached a threshold, or when a delta-time ("every 10 minutes") has reached a threshold?
We don't have to write in parquet, it's just a reasonable standard for fast IO for medium and large size data. Do you have other suggestions? As long as there is an open source reading library we could probably work around it. csv's and json are just not very size efficient once you go beyond small data sets.
Cheers, Tzahi
On Fri, Jun 30, 2023 at 10:54 AM Michael Demoret @.***> wrote:
@Tzahi-cpacket https://github.com/Tzahi-cpacket After looking into this more, one issue is that our write_to_file_stage.py will iteratively write to files as messages are received and keep the file stream open for the duration of the stage. With CSV and JSON lines format, this works well to offload data to disk. However, the only library I found that supports appending to Parquet files is fastparquet. So this leaves a few options:
- Keep all messages in memory until pipeline shutdown and then concat them into a single DataFrame which can be written to disk
- Write to parquet using the fastparquet engine in pandas
- This would require converting every DataFrame from GPU->CPU before writing to disk which will have a performance penalty
- Write to a partitioned parquet dataset
- Each message would be treated as a different partition
- A rough outline of how this would work can be found here: https://docs.rapids.ai/api/cudf/legacy/api_docs/api/cudf.io.parquet.parquetdatasetwriter/ https://url.avanan.click/v2/___https://docs.rapids.ai/api/cudf/legacy/api_docs/api/cudf.io.parquet.parquetdatasetwriter/___.YXAzOmNwYWNrZXRuZXR3b3JrczphOmc6NGQ2Mjc3NmQxMTQwNjczZjIyNTYxMmIzOWVhOWNiNTQ6NjphYmU4OmU2NWMyZmE0ZDczMGYzZjhiOGU5MjcyNTI0MzM2YTQ0OTg4YjEzMjQyY2JjZTA0MDA2NzIyYTkzOGE1Y2I0Y2E6aDpU
Are any of these options preferable over another?
— Reply to this email directly, view it on GitHub https://github.com/nv-morpheus/Morpheus/issues/980#issuecomment-1614986833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVNVGEPG24JXQ6QJ6XCBKT3XN4HDXANCNFSM6AAAAAAZBHEEOE . You are receiving this because you were mentioned.Message ID: @.***>
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem this feature solves
I want morpheus to write inference outputs in parquet format.
The current implementation of write_to_file_stage.py has some JSON and csv specific instructions in _convert_to_strings(), and raises NotImplementedError for other file types.
The inference outputs files in DFP can end up being pretty big, and are likely further processed, so a machine readable efficient format (like parquet) would make deployment more efficient.
Describe your ideal solution
write_to_file_stage.py
Describe any alternatives you have considered
No response
Additional context
No response
Code of Conduct