ifo-institute / ifohack2023

Repository for all coding activities related to ifoHack 2023
11 stars 7 forks source link

Final provision of IBS #21

Closed gerwolf closed 1 year ago

gerwolf commented 1 year ago

@VFMR Please roll out the final dataset for the ifo challenge to all VMs as discussed on Friday.

  1. Test period will be December 2022, please drop that month for all entities and leave one original complete file as it is now on the admin-vm for validation. I'll set up a Gitea repo for @muskuloes and me to finalise the leaderboard (Streamlit should be working by now).

  2. Please drop any columns such as industry classification, firm size etc. which are (almost) time-invariant and provide this information in a separate .csv file which can be merged to the main panel.

We should aim to have everything finalised by Wednesday 26 April 2023 afternoon.

gerwolf commented 1 year ago

@VFMR FYI

  1. I generated those two files; they are on admin-prod-01 and are called IBS_training_dataset.csv and IBS_test_dataset.csv. Please make sure that only the former IBS_training_dataset.csv is made available on source-date!
  2. Can you collect all columns which contain time-invariant information (such as sector-id) and collapse them into a single dataset which can be joined through idnum?
VFMR commented 1 year ago

@gerwolf

  1. Thank you for generating the files.
  2. I am confused as to why we would do this. A priori, it is not clear which columns are time invariant. Even the sector might potentially change over time. I could check the data to find out columns that are actually time invariant but that introduces other complications and assumptions from my side, mostly due to the historical complexities in the data. In particular, the idnum may not always be available for earlier years, requiring me to find a way around this. Since it shall be possible to join the time invariant dataset to the time variant data anyhow, I don't know why we would do this for the participants. They can figure the variation out themselves, right?
VFMR commented 1 year ago

@VFMR FYI

  1. I generated those two files; they are on admin-prod-01 and are called IBS_training_dataset.csv and IBS_test_dataset.csv. Please make sure that only the former IBS_training_dataset.csv is made available on source-date!

Where did you put the test data, btw? I cannot find it.

We might need an overview of all the target names because I do not think it is very helpful to predict all 150 columns of the final month. There are things like the date and time of survey participation that need not be predicted. I would suggest to have only the ~ 15 main question item targets. I can make a file with the target names and the structure of the test data. For automatic data evaluation, the test data should hence only include the idnum and this selection of features.

Also, the sample is unbalanced. Thus, participants need to know for which ids they need to predict! I suggest providing a sample_submission.csv that contains the feature names and all the idnums but leaves the other variables empty. I have prepared such a file and a new test data file to go along with it. The IBS_sample_submission.csv will be provided in the source data buckets (it is currently only in the one for work_prod_01 and the IBS_test_data.csv is in ~/Desktop/project_directory/input/.

Another important thing to note about validation: the survey items are not mandatory! they may be missing in some cases.

muskuloes commented 1 year ago

We might need an overview of all the target names because I do not think it is very helpful to predict all 150 columns of the final month. There are things like the date and time of survey participation that need not be predicted. I would suggest to have only the ~ 15 main question item targets.

The three targets of interest are: vg_statebus, vg_comexp and vg_priceexp. Those are the only columns we care to predict.

VFMR commented 1 year ago

Okay. I created a new sample_submission file including with only these targets.

gerwolf commented 1 year ago

Sorry for the confusion, here's how to go about it:

  1. I split the data into two parts; training dataset will comprise all year-months until (incl.) December 2021. Hence, test dataset for leaderboard evaluation will be the entire year 2022. Therefore, the exercise will comprise a multi-step forecasting problem. Each firm's and month's prediction will be weighted equally for the evaluation score.
  2. I removed (close to) time-invariant features such as vg_weight and sector_id; I confirmed that they are almost time-invariant: dropping duplicate rows (without year-month) reduces the panel from 1,808,565 rows to 58,247 i.e. by 97% which is a bit less than just counting unique idnum which make up about 0.01% of the entire panel size, meaning there is some time-variation in these features but not much. The idea is that this information will be provided to the participants after they have done some feature engineering and can confirm any clusters they have identified to interpret them qualitatively.
  3. In the directory source-data/work-vm-prod01/IBS Rollout Master you find the following files: IBS_paneldata_train.csv (the main dataset for model development), IBS_basedata_train.csv (provided separately at a later stage), IBS_paneldata_test.csv (the main dataset for predictions evaluation, we use only the three targets vg_statebus, vg_comexp and vg_priceexp as correctly pointed out by @muskuloes above and lastly IBS_basedata_test.csv (which isn't really required).
  4. All files can be merged through the keys year, month and idnum.
  5. If only considering the three target variables for evaluation, dropping missing values gives 24,877 complete firm-time pairs; the submissions will be evaluated against any of these 24,877 remaining pair values, these are 2,963 unique firm IDs.

@VFMR please roll out the file IBS_paneldata_train.csv in directory source-data/work-vm-prod01/IBS Rollout Master for now only. Thanks!

VFMR commented 1 year ago

the file IBS/IBS_paneldata_train.csv is now copied to all sourcedata folders. I checked VMs 2, 16, and 34. All could see the file as planned. If there are no futher requirements in this regard, the issue can be closed.

gerwolf commented 1 year ago

Great, @VFMR thanks a lot. Where did you move the files IBS_basedata_train.csv, IBS_paneldata_test.csv and IBS_basedata_test.csv?

VFMR commented 1 year ago

@gerwolf These files are in project-directory/input/... of the admin vm

VFMR commented 1 year ago

test files are now in shared-data/IBS