Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description
given the following dataframe
from random import random, choices, choice
import vaex
import numpy as np
num = 100_000
sample_ids = list(range(num//10))
labels = ["A", "B", "C", "D"]
labels_per_sample = {i: choice(labels) for i in sample_ids}
sample_ids = choices(sample_ids, k=num)
df = vaex.from_arrays(
id=list(range(num)),
sample_id=sample_ids,
span_start=np.random.randint(0, 100, num),
span_end=np.random.randint(11, 100, num),
gold=[labels_per_sample[i] for i in sample_ids],
is_good=choices([True, False],k=num)
)
I first filter out only good rows, then construct our JSON string below
and I get this. It seems to have to do with that string join. Since the spans column is type list, not string. But if I dont perform the final join at the end, I don’t get this error..
File "<string>", line 1, in <module>
File "/Users/benepstein/Documents/GitHub/run
galileo/.venv/lib/python3.9/site-packages/vaex
/arrow/numpy_dispatch.py", line 136, in
wrapper
result = f(*args, **kwargs)
File "/Users/benepstein/Documents/GitHub/run
galileo/.venv/lib/python3.9/site-packages/vaex
/functions.py", line 1414, in str_join
raise TypeError(f'join expected a list,
not {x}')
TypeError: join expected a list, not [array(['
{"span_start":64,"span_end":25,"gold":"B"}',
'{"span_start":40,"span_end":88,"gold"
:"B"}',
'{"span_start":18,"span_end":71,"gold"
:"B"}',
'{"span_start":14,"span_end":65,"gold"
:"B"}',
'{"span_start":51,"span_end":89,"gold"
:"B"}',
'{"span_start":4,"span_end":42,"gold":
"B"}',
'{"span_start":37,"span_end":22,"gold"
:"B"}',
'{"span_start":86,"span_end":34,"gold"
:"B"}',
'{"span_start":12,"span_end":38,"gold"
:"B"}',
'{"span_start":89,"span_end":44,"gold"
:"B"}'], dtype=object)]
For example, if I just do x.export("file.hdf5") i get no such complaint. Only when joining back do i see this.
if i throw in x["spans"] = x.spans.evaluate() before the join, it works. But i of course don’t want to do that since it brings everything into memory.
Software information
Vaex version (import vaex; vaex.__version__): 4.9.1
Vaex was installed via: pip / conda-forge / from source pip
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description
given the following dataframe
I first filter out only
good
rows, then construct our JSON string belowoutput:
now I use the list agg
output:
because of this odd shaping, I cannot figure out how to martial it into JSON, so I do the following
output (this can be json.loads after writing to disk)
Now I join it back
and I get this. It seems to have to do with that string join. Since the
spans
column is type list, not string. But if I dont perform the final join at the end, I don’t get this error..For example, if I just do
x.export("file.hdf5")
i get no such complaint. Only when joining back do i see this.if i throw in
x["spans"] = x.spans.evaluate()
before the join, it works. But i of course don’t want to do that since it brings everything into memory.Software information
import vaex; vaex.__version__)
: 4.9.1