[PERF] CSV reader data type detection is slow

rapidsai / cudf

cuDF - GPU DataFrame Library

https://docs.rapids.ai/api/cudf/stable/

Apache License 2.0

8.45k stars 903 forks source link

[PERF] CSV reader data type detection is slow #5080

Open OlivierNV opened 4 years ago

OlivierNV commented 4 years ago

Describe the bug data type detection represents ~25% of the CSV reader total time

Steps/Code to reproduce bug nvprof read_csv()

Expected behavior data type detection time should be negligible

OlivierNV commented 4 years ago

I initially suspected the heavy use of atomics, which is a big problem, but almost entirely masked by the cache trashfest problem of having many threads each reading one byte a few hundred bytes from each other when searching for the end of a column field, so fixing this will have to involve rewriting the way we read data from the row (maybe multiple threads per row and/or small shared mem scratch buffer per row). The good news is that fixing that will also automatically benefit the data conversion stage as well.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

GregoryKimball commented 2 years ago

@PointKernel based on your recent study of CSV type inference, do you think we should keep this issue open?