google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

Fine-tuning on custom dataset using Huggingface. #116

Closed AhmedMasryKU closed 3 years ago

AhmedMasryKU commented 3 years ago

I have a custom dataset that's similar to WTQ (It has aggregation and the dataset contains only the answer without annotations for the relevant table cells that were used to get the answer). I was following the example on Huggingface (https://huggingface.co/transformers/model_doc/tapas.html). However, it says that the data should be in SQA format which require the anwer_coordinates cells in the table. It also mentioned that we can follow the logic explained in this link (https://github.com/google-research/tapas/issues/50#issuecomment-705465960) to generate the answer_coordinates for our custom data. I have been trying to figure out how to use such logic on my custom dataset and I realized that your code only generates the tfrecords at the very end.

So, do you have any scripts or any advice how I can generate such answer coordinates for my custom training data given the final answer?
I want to produce tsv files which I can use to train the model on huggingface.

eisenjulian commented 3 years ago

Can you give an example of the input and the script you are running? Based on the code you linked, I imagine we also generate artifacts in tfrecord format that contains Interaction protobuf messages, which should be readable with tf.data.TFRecordDataset and will contain the result of the matching process.

AhmedMasryKU commented 3 years ago

Thanks for your reply. You preprocessing codes only produce tf.data.TFRecordDataset which is not readable by the huggingface pytroch model I guess. Anyway, I actually found a script which can prepare the custom dataset for the hugging face model input format. If anyone wants it, they can check it out at https://github.com/NielsRogge/tapas_utils

eisenjulian commented 3 years ago

Just to be more clear, I was suggesting the TFRecordDataset is read and converted into the format you need to huggingface, we are also exploring the option of export jsonl version of interactions for easier interoperability, stay tuned.