vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Join issues with lists #2040

Open Ben-Epstein opened 2 years ago

Ben-Epstein commented 2 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description

given the following dataframe

from random import random, choices, choice
import vaex
import numpy as np

num = 100_000

sample_ids = list(range(num//10))
labels = ["A", "B", "C", "D"]

labels_per_sample = {i: choice(labels) for i in sample_ids}

sample_ids = choices(sample_ids, k=num)

df = vaex.from_arrays(
    id=list(range(num)),
    sample_id=sample_ids,
    span_start=np.random.randint(0, 100, num),
    span_end=np.random.randint(11, 100, num),
    gold=[labels_per_sample[i] for i in sample_ids],
    is_good=choices([True, False],k=num)
)

I first filter out only good rows, then construct our JSON string below

df_good = df[df["is_good"]]
df_good["spans"] = "{" + '"span_start":' + df_good.span_start.astype("str") + ',"span_end":' + df_good.span_end.astype("str") + ',"gold":' + '"' + df_good.gold + '"}'
display(df[["spans"]])

output:

#   spans
0   {"span_start":73,"span_end":39,"gold":"A"}
1   {"span_start":13,"span_end":65,"gold":"D"}
2   {"span_start":32,"span_end":51,"gold":"D"}
3   {"span_start":71,"span_end":91,"gold":"A"}
4   {"span_start":93,"span_end":16,"gold":"D"}
... ...

now I use the list agg

x = df_good.groupby("sample_id", agg={"spans":vaex.agg.list})
display(x[["spans"]])

output:

#   spans
0   '[\'{"span_start":65,"span_end":26,"gold":"D"}\', ...
1   '[\'{"span_start":40,"span_end":49,"gold":"B"}\', ...
2   '[\'{"span_start":12,"span_end":65,"gold":"C"}\', ...
3   '[\'{"span_start":67,"span_end":48,"gold":"D"}\', ...
4   '[\'{"span_start":78,"span_end":51,"gold":"C"}\', ...
... ...

because of this odd shaping, I cannot figure out how to martial it into JSON, so I do the following

x["spans"] = "[" + x["spans"].str.join(",") + "]"
display(x[["spans"]])

output (this can be json.loads after writing to disk)

#   spans
0   '[{"span_start":65,"span_end":26,"gold":"D"},{"s...
1   '[{"span_start":40,"span_end":49,"gold":"B"},{"s...
2   '[{"span_start":12,"span_end":65,"gold":"C"},{"s...
3   '[{"span_start":67,"span_end":48,"gold":"D"},{"s...
4   '[{"span_start":78,"span_end":51,"gold":"C"},{"s...

Now I join it back

df2 = df.copy()
cols = ["sample_id", "gold"]
df2 = df2[cols].groupby(cols,agg={"__tmp":"count"}).drop("__tmp")
df2.join(x, on="sample_id")

and I get this. It seems to have to do with that string join. Since the spans column is type list, not string. But if I dont perform the final join at the end, I don’t get this error..

File "<string>", line 1, in <module>                          
                               File "/Users/benepstein/Documents/GitHub/run                  
                             galileo/.venv/lib/python3.9/site-packages/vaex                  
                             /arrow/numpy_dispatch.py", line 136, in                         
                             wrapper                                                         
                                 result = f(*args, **kwargs)                                 
                               File "/Users/benepstein/Documents/GitHub/run                  
                             galileo/.venv/lib/python3.9/site-packages/vaex                  
                             /functions.py", line 1414, in str_join                          
                                 raise TypeError(f'join expected a list,                     
                             not {x}')                                                       
                             TypeError: join expected a list, not [array(['                  
                             {"span_start":64,"span_end":25,"gold":"B"}',                    
                                     '{"span_start":40,"span_end":88,"gold"                  
                             :"B"}',                                                         
                                     '{"span_start":18,"span_end":71,"gold"                  
                             :"B"}',                                                         
                                     '{"span_start":14,"span_end":65,"gold"                  
                             :"B"}',                                                         
                                     '{"span_start":51,"span_end":89,"gold"                  
                             :"B"}',                                                         
                                     '{"span_start":4,"span_end":42,"gold":                  
                             "B"}',                                                          
                                     '{"span_start":37,"span_end":22,"gold"                  
                             :"B"}',                                                         
                                     '{"span_start":86,"span_end":34,"gold"                  
                             :"B"}',                                                         
                                     '{"span_start":12,"span_end":38,"gold"                  
                             :"B"}',                                                         
                                     '{"span_start":89,"span_end":44,"gold"                  
                             :"B"}'], dtype=object)]  

For example, if I just do x.export("file.hdf5") i get no such complaint. Only when joining back do i see this.

if i throw in x["spans"] = x.spans.evaluate() before the join, it works. But i of course don’t want to do that since it brings everything into memory.

Software information