rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.24k stars 884 forks source link

[FEA] Support read_fwf functionality in cudf #15924

Open a-hirota opened 3 months ago

a-hirota commented 3 months ago

Missing Pandas Feature Request

Support for pandas.read_fwf.

Profiler Output

N/A

Additional context

Background: In the legacy enterprise space, COBOL is in continuous use, and the reality is that a complete overhaul of legacy systems is difficult to achieve at this time. If the processing of legacy systems can be made to run on GPUs, this could bring significant change to this area. Since COBOL deals with fixed-width flat files, support for fixed-width files could be a first step in addressing this need.

Code Example: For instance, consider the following example data:

data = '''\
abcdef123456790.1234567abc           1234
ABCDEF123456790.1234567abc           5678
'''
with open('data.txt', 'w') as f:
    f.write(data)

import pandas as pd

# Example usage of pandas read_fwf
df = pd.read_fwf('data.txt', colspecs=[(0, 6), (6, 23), (23, 37), (37, 41)], header=None)

# Ensure that the output is not in scientific notation
pd.set_option('display.float_format', lambda x: '%.7f' % x)

print(df)

Expected output:

        0                 1    2     3
0  abcdef 123456790.1234567  abc  1234
1  ABCDEF 123456790.1234567  abc  5678

Supplement:

brandon-b-miller commented 3 months ago

Hi @a-hirota , Thanks for raising this issue. While I'm not aware of any current efforts to implement this feature, I'd like to leave this issue open for further discussion and updates in the future. Enough people expressing interest here might be enough to generate some ideas and eventually move forward.

GregoryKimball commented 3 months ago

Hello @a-hirota, thank you for your request. I believe this reader is something that we can support by combining cudf APIs today. Would you please let me know if this works for you?

series = cudf.read_text('data.txt', delimiter='\n')
colspecs=[(0, 6), (6, 23), (23, 37), (37, 41)]

df = cudf.DataFrame()
for n, d in enumerate(colspecs):
    df[n] = series.str.slice(d[0], d[1]).str.strip()

    if df[n].str.contains(r'\d+\.\d+').all():
        df[n] = df[n].astype('float64')
    elif df[n].str.contains(r'\d+').all():
        df[n] = df[n].astype('int64')

print(df)
        0                 1    2     3
0  abcdef 123456790.1234567  abc  1234
1  ABCDEF 123456790.1234567  abc  5678
a-hirota commented 3 months ago

Hello @GregoryKimball , thank you for your prompt response! I appreciate the swift assistance.

I've conducted experiments and confirmed that it's functioning as expected.

However, due to the necessity of string slicing for each column, the processing time is somewhat inferior to that of the CPU, particularly when dealing with a dataset of around 1 million records across a maximum of 2000 columns (which represents roughly 1/50th of our usual daily processing volume). While the GPU processing time, including read time, surpasses that of the CPU, it doesn't result in a significant speedup:

< String Slicing Time > CPU: 0.1727 seconds GPU: 0.5404 seconds Experiment Results: https://github.com/a-hirota/rapids_qa/blob/main/fwf_read_nvidia15924.ipynb

I believe that providing the colspecs at the time of reading, similar to read_fwf, would eliminate the need for redefining the positions of the series object after reading. This optimization could lead to a significant speedup compared to the CPU. (Although not initially included in my Example usage, specifying dtypes might also be beneficial.)

Additionally, legacy systems tend to have lightweight computational tasks, mainly rule-based logic, resulting in a majority (80-90%) of processing time being allocated to I/O operations.