google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

add_numeric_table_values() has wrong behavior with table shuffled with pd.DataFrame.sample() #107

Closed jeromemassot closed 3 years ago

jeromemassot commented 3 years ago

Hi Google Research team,

Very very very strange (at least for me, a Computer Science newbie) with the function when the table ingested has been resampled with the pd.DataFrame.sample() method.

In the following block of code, the rows iterator returns corrupted rows with my table. I have check the iterrows() outside the Tapas Tokenizer and the rows returned are correct. But inside the Tokenizer, the rows are sometimes ok but sometimes Cell objects and corresponding to wrong rows !!

# Second, replace cell values by Cell objects
for row_index, row in table.iterrows():
    for col_index, cell in enumerate(row):
            table.iloc[row_index, col_index] = Cell(text=cell)

The direct result in my case is a crash in the normalize_for_match() method : AttributeError: 'Cell' object has no attribute 'lower' which is normal since several rows in the table now are of Cell type and not str.

I cannot see why the rows iterator suddenly returns corrupted data, for both Type and Values.

Thanks

Best regards

Jerome

ghost commented 3 years ago

Hi Jerome!

I understand that you are talking about the Tapas Tokenizer in huggingface. Does it make sense do raise the issue there?

I am closing this issue but feel free to reopen if I am missing something.

jeromemassot commented 3 years ago

Indeed, sorry for the confusion. Thanks Thomas