megagonlabs / starmie

Resources for PVLDB 2023 submission
16 stars 5 forks source link

Pretraining Datasets #3

Open superctj opened 1 year ago

superctj commented 1 year ago

Thank you for open-sourcing the code! I didn't find descriptions about pretraining datasets in the paper. Was Starmie pertained on benchmark datasets?

jw-megagon commented 1 year ago

Sorry for the late reply, we use the Viznet tables for pre-training the column encoder which can be found in this page: https://github.com/megagonlabs/sato/tree/master/table_data

IbraheemTaha commented 7 months ago

Thanks a lot for your sourcing the code and your answer @jw-megagon. I was wondering did you train the model on all tables of VisNet (80000) or you used the multi-column sets only? Moreover, could you please provide the hyperparameters (--batch_size, --lr --lm, --n_epochs , --max_len , --size, --projector, --augment_op, --sample_meth, --table_order) you used in the training process?

Thanks in advance!

Kirito-Aus commented 5 months ago

To obtain the training data for Viznet, I saved all the tables from the folders within the viznet_tables/webtableX/KX_multi-col directory at https://github.com/megagonlabs/sato/tree/master/table_data. These tables were then stored in the /data/viznet/tables folder of the project, and I also simplified their names to make them more concise. Could you please confirm if my actions were correct?

Kirito-Aus commented 5 months ago

I have obtained the data/viznet/tables through the steps mentioned above. However, during the pretrain process, the file data/viznet/test.csv is required (line 284 in pretrain.py). Can you please tell me where I can obtain this file?

Thanks in advance! : )